We're updating the issue view to help you get more done. 

Cluster nodes leaving break component routing

Description

When, in an established cluster, a node shuts down, the routing of components goes awry. More details in https://discourse.igniterealtime.org/t/after-one-cluster-node-goes-down-clients-cannot-rejoin-rooms

Environment

None

Acceptance Test - Entry

For the purpose of this test, we can distinguish two categories of caches:

  1. Those used to make available for quick access the results of costly computation (eg: database lookups)

  2. Those that contain state (eg: routing tables)

The former are expected to contain data that, when not available in the cache, can be recomputed. These are of lesser importance in the context of clustering (as cluster nodes can likely recompute cache entries if they're missing)

The latter typically cannot be recomputed, as their state lives in memory only. These caches are subject to test.

I'm assuming that we can determine in what category each cache belongs by looking at the configuration of each cache: when a cache is configured to allow for a finite amount of data, that eventually expires, then cache entries can likely be recomputed, and the cache probably is of the first category (not of interest to this test). When caches are configured to hold an infinite amount of data for an unlimited amount of time, it's likely to be of the second category. We should test these caches in context of this JIRA issue.

Using the definition above, the caches that need not be tested are: Client Session Info Cache, Entity Capabilities, Entity Capabilities Users, Favicon Hits, Favicon Misses, File Transfer, File Transfer Cache, Group, Group Metadata Cache, JID Domain-parts, JID Node-parts, JID Resource-parts, Last Activity Cache, LDAP UserDN, Locked Out Accounts, Multicast Service, Offline Message Size, Offline Presence Cache, PEPServiceManager, Privacy Lists, Remote Users Existence, Roster, RosterItems, User, VCard

The tests that are of interest (and for which I'll outline expected behavior in the text below): Components, Components Sessions, Connection Managers Sessions, Directed Presences, Disco Server Features, Disco Server Items, Incoming Server Sessions, Pubsub InMemory Default Node Config, Pubsub InMemory Nodes, Pubsub InMemory Published Items, Routing AnonymousUsers Cache, Routing Components Cache, Routing Servers Cache, Routing User Sessions, Routing Users Cache, Sessions by Hostname, Validated Domains

Note that I'm considering caches that are available in Openfire 4.6.0-beta, without any plugins, only.

Given two cluster nodes (a and b):

The assertions defined below should all hold true on all cluster nodes, before, during, and after clustering (six verification checks are needed).

The scenarios 'before', 'during' and 'after' clustering are assume that the individual nodes are all remain running, and that clustering is enabled/disabled either through configuration (disabling the feature in Openfire) or through network interruptions. The 'before' and 'after' scenarios could also be realized by starting up/shutting down Openfire instances completely. It is expected that all scenarios lead to the same expected behavior, but this should be verified by testing.

Expected behavior scenarios for cache 'Components' (key is a domain-part only JID, value is an array of node IDs)

Scenario A
The cache will initially (after a cold start) hold the same data (reflecting the internal components that are always active) on all nodes.

  • Before clustering: various entries, each entry has a value containing one node ID (that of the local node)

  • During clustering: the same amount of entries, each entry has a value that contains two node IDs (for each node)

  • After clustering: the same amount of entries, each entry has a value contaiing one node ID (that of the local node)

Scenario B
Additional entries can be added by connecting an external component or installing a plugin that provides an additional internal component. Assuming that the component is connected to exactly one node:

  • Before clustering: the additional entry should be visible on only the node to which the external component connects, its value containing one nodeID (that of the node to which the component is connected).

  • During clustering: the additional entry should be visible on both nodes, its value containing one nodeID (that of the node to which the component is connected)

  • After clustering: the additional entry should be visible on only the node to which the external component connects, its value containing one nodeID (that of the node to which the component is connected).

Scenario C
A variant of scenario B, but an instances of the component now connected to each node (as opposed to only one node)

  • Before clustering: the additional entry should be visible on each node, its value containing one nodeID (that of the local node)

  • During clustering: the additional entry should be visible on both nodes, its value containing two nodeIDs

  • After clustering: the additional entry should be visible on each node, its value containing one nodeID (that of the local node)

Expected behavior scenarios for cache 'Disco Server Features' (key is a feature, value is an array of node IDs)

Scenario A
The cache will initially (after a cold start) hold the same data (reflecting the default features) on all nodes.

  • Before clustering: various entries, each entry has a value containing one node ID (that of the local node)

  • During clustering: the same amount of entries, each entry has a value that contains two node IDs (for each node)

  • After clustering: the same amount of entries, each entry has a value contaiing one node ID (that of the local node)

Scenario B
Additional entries can be added by connecting a plugin that implements a Server Feature Provider. Assuming that the component is connected to exactly one node:

  • Before clustering: the additional entry should be visible on only the node on which the plugin is loaded, its value containing one nodeID (that of the node on which the plugin is loaded).

  • During clustering: the additional entry should be visible on both nodes, its value containing one nodeID (that of the node on which the plugin is loaded)

  • After clustering: the additional entry should be visible on only the node on which the plugin is loaded, its value containing one nodeID (that of the node on which the plugin is loaded).

Scenario C
A variant of scenario B, but with having an instances of the Server Feature Provider plugin loaded on each node (as opposed to only one node)

  • Before clustering: the additional entry should be visible on each node, its value containing one nodeID (that of the local node)

  • During clustering: the additional entry should be visible on both nodes, its value containing two nodeIDs

  • After clustering: the additional entry should be visible on each node, its value containing one nodeID (that of the local node)

Expected behavior scenarios for cache 'Disco Server Items' (key is a feature, value is an custom object that includes an element as well as a collection of nodes)

Implementation-wise, the usage of this cache differs considerably from the ones above. Those had collection-based values, while this cache has a value that always is a singular object (that internally contains a collection).

Scenario A
The cache will initially (after a cold start) hold the same data (reflecting the default items being provided) on all nodes.

  • Before clustering: various entries, each entry refers to one node ID (that of the local node)

  • During clustering: the same amount of entries, each entry refers to two node IDs (for each node)

  • After clustering: the same amount of entries, each entry refers to one node ID (that of the local node)

Scenario B
Additional entries can be added by connecting a plugin that implements a Server Item Provider. Assuming that the component is connected to exactly one node:

  • Before clustering: the additional entry should be visible on only the node on which the plugin is loaded, its value referring to one nodeID (that of the node on which the plugin is loaded).

  • During clustering: the additional entry should be visible on both nodes, its value referring to one nodeID (that of the node on which the plugin is loaded)

  • After clustering: the additional entry should be visible on only the node on which the plugin is loaded, its value referring to one nodeID (that of the node on which the plugin is loaded).

Scenario C
A variant of scenario B, but with having an instances of the Server Feature Provider plugin loaded on each node (as opposed to only one node)

  • Before clustering: the additional entry should be visible on each node, its value containing one nodeID (that of the local node)

  • During clustering: the additional entry should be visible on both nodes, its value containing two nodeIDs

  • After clustering: the additional entry should be visible on each node, its value containing one nodeID (that of the local node)

Expected behavior scenarios for cache 'Routing Components Cache'

This cache content and behavior should be identical to that of the 'Components' cache.

Expected behavior scenarios for cache 'Components Sessions'
(to be determined)

Expected behavior scenarios for cache 'Connection Managers Sessions'
(to be determined)

Expected behavior scenarios for cache 'Directed Presences'
(to be determined)

Expected behavior scenarios for cache 'Incoming Server Sessions'
(to be determined)

Expected behavior scenarios for cache 'Pubsub InMemory Default Node Config'
(to be determined)

Expected behavior scenarios for cache 'Pubsub InMemory Nodes'
(to be determined)

Expected behavior scenarios for cache 'Pubsub InMemory Published Items'
(to be determined)

Expected behavior scenarios for cache 'Routing AnonymousUsers Cache'
(to be determined)

Expected behavior scenarios for cache 'Routing Servers Cache'
(to be determined)

Expected behavior scenarios for cache 'Routing User Sessions'
(to be determined)

Expected behavior scenarios for cache 'Routing Users Cache'
(to be determined)

Expected behavior scenarios for cache 'Sessions by Hostname'
(to be determined)

Expected behavior scenarios for cache 'Validated Domains'
(to be determined)

Assignee

Guus der Kinderen

Reporter

Guus der Kinderen

Labels

None

Expected Effort

None

Components

Fix versions

Priority

Critical
Configure