As reported in https://discourse.igniterealtime.org/t/openfire-cluster-unable-to-recover-from-nodes-crashing/76594, the following steps leads to NullPointer- and ClassCasting exceptions.
Send a message from client A (connected to node A) to client B (connected to node B)
Client B receives the message
Send a SIGTERM to the OpenFire process running on node A
Restart OpenFire on node A
Reconnect client A
Send a message from client A (connected to node A) to client B (connected to node B)
Client B receives the message
Send a SIGTERM to the OpenFire process running on node B
Restart OpenFire on node B
Reconnect client B
Send a message from client A (connected to node A) to client B (connected to node B)
Logs from Node B:
FYI, in case it helps you any - I found that a "fast" restart of openfire causes this every time, but if you leave it down for a while (over 30 seconds by default I think) between 3-4, 8-9 - it is more generally able to come back online.
I thought I had opened a jira issue on this, but can't seem to find it now.
Ah, it was in the community, not JIRA: https://discourse.igniterealtime.org/t/openfire-hazelcast-cant-handle-fast-restart-of-node-but-ok-if-slow/62775
Move to GitHub at https://github.com/igniterealtime/openfire-hazelcast-plugin/issues/4