Peer closing stream leads to timeout
Description
Environment
Attachments
Activity
Guus der Kinderen November 9, 2023 at 9:01 PM
The discussion with Netty community members on Discord was done here:
They’ve concluded with this text. Although I do believe that this is a Smack bug, the suggested approach might offer some robustness. I’m not sure how easy it is to implement this.
<quote starts below>
i don't think i would really call this a client bug. So here is my theory:
Client sends stream end message
Server processes message
Server sends stream end message
Server queues message somewhere / makes some syscalls / whatever
the channelfuture is "done" (Note: this doesn't mean that the TCP/IP stack actually sent anything yet or that the client actually received the message, or that it already processed it)
Netty executes the listener and closes the connection
The close stuff is also queued somewhere (i don't know enough about TCP/IP stacks xD)
Both segments hit the TCP/IP stack of the client and i guess it prioritzes the connection termination? Dunno. Either way, it never acks the other segment and never forwards it to your client application.
How i would address this while still letting the client close the connection:
Client sends stream end message
Server processes message
Server sends stream end message
Server changes the state of the connection, future packets from the client will be ignored from now on.
Client receives and processes server end message
Client closes the connection
Server detects that the connection is closed, everything is fine now, it executes whatever connection closed logic it has (or does nothing special, depends on your usecase)
if the client for some reason fails to close the connection after the stream end messages are exchanged then the state of the connection should ensure that the server hits a readtimeout sooner or later (make sure to place the timeout handler after your decoding logic)

Dan Caseley November 9, 2023 at 7:00 PM

Dan Caseley November 9, 2023 at 5:25 PM
There was one timeout visible when running Smack with
-Dsinttest.securityMode=required
-Dsinttest.enabledConnections=tcp
-Dsinttest.enabledTests=ModularXmppClientToServerConnectionLowLevelIntegrationTest
Although that test passes, the output reports that the Smack client never received the </stream:stream>
from the server.
Openfire trace logging shows this (amongst some boilerplate creation and deletion of unused users and connections):
I was initially suspicious of this:
since the end-of-stream can’t happen on a session without Stream Management (SM), but Guus rightly pointed out that the SM is ended in the first line, so that by the time the second line occurs, it’s correct (albeit slightly suboptimal text).
Worse, this timeout only happened most times.
On digging further, we spotted some risky async operations with Futures that could lead to exactly this behaviour, where a connection could be closed before the stream closure XML had been emitted over the network.
We fixed that in and deployed it to the internet facing server referenced in the above logs (netty03.goodbytes.im
) but the issue continued on most attempts (although curiously never the first attempt after a restart).
After consulting with the lovely folks in Netty’s Discord community, we used Wireshark on the local machine running Spark. This required importing the private key of netty03 into Wireshark, disabling TLS 1.3 and all ECDHE ciphers. This proved that the network adapter of the machine running Spark received the stream close packet despite reporting it missing. Comparison between captures of working and failing runs showed a small timing difference between the stream close and the server notifying the client of an impending closure (fail was approx 0.7ms, passing at ~1.5ms) as well as a graceful 2-way FIN/ACK conversation in the positive case.
We tested adding a 50ms delay between the send and the close on the server side, which reliably solved the problem. Reducing this down to a mere 1ms was also successful.
Having considered this change further we proved that the standard TCP connection in waitForClosingStreamElementTest
doesn’t have the same problems as the ModularXmppClientToServerConnection in testDisconnectAfterConnect
and have decided that further changes to Openfire, adding to the maintenance and configuration and documentation burden to solve for a likely bug in a non-standard configuration of the Smack client isn’t useful. Instead, we’ll open a bug in the Smack repo.

Dan Caseley October 20, 2023 at 5:47 PM
I’ve attached a log file from the destination Openfire and its accompanying Smack log
The wait happens after 16:47:13 in Openfire
The error reads:
Exception while waiting for closing stream element from the server ModularXmppClientToServerConnection[sinttest-ez3tk-22@example.org/3c56de17-849f-4147-a111-b00660e06f9f]
Whilst Openfire reports:
Closing session with address sinttest-ez3tk-22@example.org/3c56de17-849f-4147-a111-b00660e06f9f and streamID 2b817gdydm does not have SM enabled.
One side is mistaken about streams…

Dan Caseley October 20, 2023 at 5:39 PM
I can’t reproduce it when running the tests that Openfire uses.
I can’t reproduce it when running any of the individual tests that Openfire excludes.
I can reproduce it when running all of them
This issue affects a version of Openfire that is as of yet unreleased. It should not affect anyone running a proper release of Openfire. It is introduced by .
Since updating to Netty, the Smack Integration Tests, that are executed as part of Openfire’s CI pipeline, are taking a lot longer to complete.
Upon inspection, it appears that Smack is timing out on a reaction to it closing the stream on their end. Openfire probably neglects to send a
</stream:stream>
back. This would break the XMPP specification.