Apparent thread lock (deadlock?) in MUC after clustering reimplementation

Description

This issue affects the as-of-yet unreleased rewrite of Openfire clustering and MUC. This bug has not been found in any released version of Openfire.

This morning, while ignite is running Openfire 4.7.0 Alpha, build f8a96d1, I was unable to join the open_chat room. When looking at the logs of the second server node, I found that the thread dump plugin was regularly creating thread dumps. It detects that the core thread pool is exhausted.

When looking at thread dumps from both server nodes, both seem to have threads 'stuck' in waiting for MUC-related code.

A dump from each server node has been attached to this issue.

In the thread dumps of the second node, 7 threads are waiting to acquire a lock that is supposedly being held by a thread named socket_c2s_ssl-thread-5. That thread itself is waiting in a lock-related routine. At this time, I don't know if this relates to the same lock.

The last entries in the log file from the offending thread are from roughly 1.5 hours before (which suggests that the code is properly stuck):

Resource usage on the servers is normal (low)

For all the threads that seem related to this issue, the time of the last message that was logged is added to the table below. Sorted by time, this data might hold clues on the order of events.

last log

server

thread name

log message

07:15:40

node 1

pool-5-thread-84

org.jivesoftware.openfire.muc.spi.LocalMUCRoomManager - Obtaining lock for room 'open_chat' of service 'conference'

07:15:40

node 1

Server SR - 572505387

org.jivesoftware.openfire.muc.spi.LocalMUCRoomManager - Obtaining lock for room 'open_chat' of service 'conference'

07:15:42

node 1

pool-5-thread-88

org.jivesoftware.openfire.muc.spi.LocalMUCRoomManager - Obtaining lock for room 'open_chat' of service 'conference'

07:15:44

node 2

Server SR - 1410996709

org.jivesoftware.openfire.muc.spi.LocalMUCRoomManager - Obtaining lock for room 'open_chat' of service 'conference'

07:15:54

node 2

socket_c2s_ssl-thread-2

org.jivesoftware.openfire.muc.spi.LocalMUCRoomManager - Obtaining lock for room 'open_chat' of service 'conference'

07:24:05

node 2

socket_c2s_ssl-thread-5

org.jivesoftware.openfire.muc.spi.LocalMUCRoomManager - Obtaining lock for room 'open_chat' of service 'conference'

07:29:32

node 2

socket_c2s_ssl-thread-10

org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Stanza recipient: open_chat@conference.igniterealtime.org/takurafelix, room name: open_chat, sender: takurafelix@igniterealtime.org/phone.QhtUl-Z6KZoNsQ

07:35:20

node 1

socket_c2s_ssl-thread-3

org.jivesoftware.openfire.muc.spi.LocalMUCRoomManager - Obtaining lock for room 'open_chat' of service 'conference

07:41:04

node 2

socket_c2s_ssl-thread-13

org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Stanza recipient: open_chat@conference.igniterealtime.org/takurafelix, room name: open_chat, sender: takurafelix@igniterealtime.org/phone.QhtUl-Z6KZoNsQ

07:56:29

node 1

TaskEngine-pool-767

org.jivesoftware.openfire.muc.spi.LocalMUCRoomManager - Removing room 'mrtodo_897aa7_aa1' of service 'conference'

08:08:02

node 2

socket_c2s_ssl-thread-23

org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Stanza recipient: open_chat@conference.igniterealtime.org/takurafelix, room name: open_chat, sender: takurafelix@igniterealtime.org/phone.QhtUl-Z6KZoNsQ

08:13:46

node 2

socket_c2s_ssl-thread-28

org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Stanza recipient: open_chat@conference.igniterealtime.org/takurafelix, room name: open_chat, sender: takurafelix@igniterealtime.org/phone.QhtUl-Z6KZoNsQ

08:22:07

node 2

socket_c2s_ssl-thread-34

org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Stanza recipient: open_chat@conference.igniterealtime.org/takurafelix, room name: open_chat, sender: takurafelix@igniterealtime.org/phone.QhtUl-Z6KZoNsQ

08:31:43

node 2

socket_c2s_ssl-thread-41

org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Stanza recipient: open_chat@conference.igniterealtime.org/takurafelix, room name: open_chat, sender: takurafelix@igniterealtime.org/phone.QhtUl-Z6KZoNsQ

08:38:06

node 2

socket_c2s_ssl-thread-43

org.jivesoftware.openfire.muc.spi.MultiUserChatServiceImpl - Stanza recipient: open_chat@conference.igniterealtime.org/takurafelix, room name: open_chat, sender: takurafelix@igniterealtime.org/phone.QhtUl-Z6KZoNsQ

08:48:27

node 1

socket_c2s-thread-3

org.jivesoftware.openfire.muc.spi.LocalMUCRoomManager - Obtaining lock for room 'open_chat' of service 'conference'

08:56:29

node 1

TaskEngine-pool-789

org.jivesoftware.openfire.muc.spi.LocalMUCRoomManager - Removing room 'mrtodo_897aa7_aa1' of service 'conference'

Related partial stack traces for each thread that is waiting on the lock held by socket_cs2_ssl-thread-5 follow, starting with the thread that owns the lock being waited on by the other threads.

Also, on the second node, this code seems to be waiting on possibly the same lock:

On the first node, these threads seem to be waiting on possibly the same lock:

Environment

None

Attachments

2

Activity

Show:

Guus der Kinderen December 2, 2021 at 5:40 PM

Cautiously declaring victory.

Daryl Herzmann November 30, 2021 at 5:23 PM

Igniterealtime’s Openfire is now dogfooding a change referencing this ticket to see if the deadlocks are gone.

Guus der Kinderen November 30, 2021 at 4:11 PM

One thing that’s a potential source of problems: when a stanza gets ‘bounced’ by the S2S implementation, the processing of the bounce is processed inline with the original code (meaning that all locks held by the S2S processing remain in place). This arguably isn’t correct, as processing of the bounce shouldn’t need to be under S2S lock (unlike the original stanza that they’re a bounce for, these bounces are not so much ‘outgoing’ stanzas).

I’m seeing the same mechanism in a new deadlock, described on .

Guus der Kinderen November 30, 2021 at 1:58 PM

pool-5-thread-84 is busy removing a user from all chat rooms. It seems to do so in reaction to a 'bounced stanza'. That stanza was bounced by LocalOutgoingServerSession, as it could not send process/deliver the original stanza. Processing of the bounced stanza is blocked in the routine that tries to obtain a lock on a room from which to remove a the user.

The bounced stanza seems to be this one (I've formatted the stanza for readability):

pool-5-thread-84 seems to have already attempted to remove user 'guus.der.kinderen' from chat room 'smack':

pool-5-thread-84 next tries to remove the same user from room 'open_chat':

That's where pool-5-thread-84 gets 'stuck'. The thread remains on the location where it is attempting to obtain that lock, as shown in this stack trace:

The corresponding code snippet from MultiUserChatServiceImpl.removeChatUser:

It should be noted that pool-5-thread-84 holds a lock on DomainPair while attempting to remove the chat user.

Side-note: I can't immediately see a reason why the stanza that caused this bounce was processed by LocalOutgoingServerSession in the first place, as the addressee (guus.der.kinderen at ignite) is a local user. This should be the subject of further analysis.

The other thread, TaskEngine-pool-752, is broadcasting a chatroom leave presence, triggered by a session having been closed.
Processing is blocked in the routine that broadcasts the corresponding presence update to a remote domain.

Note that the addressing of the stanza in the last line seems off. The intended recipient is likely not yet modified to be 'holger' when this line is logged. This is probably fine, just a quirk of the way things are logged.

That's where TaskEngine-pool-752 gets 'stuck'. The thread remains on the location where it is attempting to obtain a lock for a DomainPair, representing the remote domain of holger, presumably, as shown in this stack trace:

What I do not get: why does thread pool-5-thread-84 hold a lock to the DomainPair for Holger's server? Is that the original outgoing session that caused a bounce?

Guus der Kinderen November 29, 2021 at 9:20 AM

Lock objects themselves do not do anything. Calling the lock() method on them is where things start to get interesting. However, this action should be ‘re-entrant’, meaning that when a thread has already acquired the mutex on the lock, it can immediately do so (successfully) again.

I do not think this is a problem. The problem here is that we have two flows that are in odds with each-other:

  • One flow first acquires lock A, does some stuff, and then wants to acquire lock B (while still holding lock A).

  • Another flow first acquires lock B, then does some other stuff, and then wants to acquire lock A (while still holding lock B.)

This results in a situation where both flows (threads) are waiting indefinitely on each-other.

I do not expect that this is caused by a bug in the locking mechanism itself. Instead, I suspect that this is a bug in how we utilize them. Ideally, we should prevent a scenario where two different locks can be obtained by flows in different orders.

Fixed

Details

Assignee

Reporter

Components

Fix versions

Affects versions

Priority

Created November 24, 2021 at 12:49 PM
Updated March 10, 2025 at 3:57 PM
Resolved December 2, 2021 at 5:40 PM