LocalOutgoingServer's canProcess error handling introduces deadlock

Description

IgniteRealtime (running 5.0.0-SNAPSHOT) was in a deadlock-ish state (without an actual deadlock being reported - locks held over the cluster contributed to the erroneous state, but won’t be reported as a deadlock’ed state).

This was logged on one of the cluster nodes:

What is confusing is that a outgoing server session is routing a stanza to a local entity (in this case, a MUC room).

Judging from line numbers in the stack trace, it can be deduced that this stanza must have been an error stanza that is triggered by LocalSession#canProcess returning false.

A few things are notable here:

  • the else block of the canProcess() check assumes that the false value is strictly caused by a Privacy List (XEP-0016)-based condition. That’s not at all the case in every implementation for the canProcess() method.

  • The LocalSession that’s in play in this instance must have been a LocalOutgoingServerSession: as it’s triggered by OutgoingSessionPromise. The canProcess() implementation in this class already sends out its own error. It appears that another error is sent in LocalSession

  • In LocalOutgoingServerSession’s implementation, we’ve previously found that it was a good idea to send the error asynchronously to prevent deadlocks (OF-2341 OF-2342). The ‘second’ error that’s being reflected by LocalSession is not sent asynchronously (and seems to trigger a deadlock).

It appears to be desirable to modify the processing of the return value of canProcess. It must both be consistent with the intended error based on the type of problem (eg: Privacy Lists require strict processing, different from connectivity issues). It also must not be duplicated.

Environment

None

Activity

Show:
Fixed

Details

Assignee

Reporter

Fix versions

Priority

Created March 10, 2025 at 3:57 PM
Updated March 10, 2025 at 7:14 PM
Resolved March 10, 2025 at 7:14 PM