getCachedSize() of Cachable returns incorrect value (causing OOMs!)
Daniel Haigh (StaticVortex) reports the cause of a memory-leak-like problem that is causing outages in his environment. In summary, the cause of the outage is an incorrect calculation of #getCachedSize() of Roster, which causes the Roster cache to not evict instances when it should.
We must review all implementations of Cachable#getCachedSize() and make sure that correct sizes are returned.
More details in this discussion: http://www.igniterealtime.org/community/message/199961#199961
Modified the calculation of cache sizes for Roster and RosterItem.
I would welcome changing the cache implementation. In the same effort, we should move Coherence out, and introduce an open source alternative. This will make the clustering solution available to everyone.
I would object against the use of WeakReference-based caches in Openfire. Although I like the general idea and does have advantages, but it has a couple of serious drawbacks too:
The LRU-based eviction policy that comes with WeakReference-based caches is based on a recommendation in the VM specification only - there is no hard requirement. The javadoc for SoftReference states: "Virtual machine implementations are [...] encouraged to bias against clearing recently-created or recently-used soft references." The HotSpot FAQ states: "This behavior is not part of the VM specification, however, and is subject to change in future releases. Likewise the -XX:SoftRefLRUPolicyMSPerMB flag is not guaranteed to be present in any given release." We should not base critical functionality in Openfire on features that come with these kind of disclaimers.
When applying WeakReference-based caches, memory usage patterns will change dramatically - this will make monitoring the application harder and more confusing.
All caches will have similar eviction policies - cache-specific behavior will be hard to achieve.
(I have logged a related e-mail discussion in )
Changing the cache implementation will be quite a lot of work, and will have considerable impact. I'm not comfortable doing that in an effort to fix this specific bug (which I consider a blocker for the 3.6.5 release). As I want to release 3.6.5 release soon, I feel we should push back changing the caching implementation to a later version.
One should replace the caches with a state-of the-art-implementation of caches. Either by an existing cache provider or by using WeakReferences to make sure that the caches get cleared when more memory is needed.
Daniel provides a workaround too:
We corrected the issue in the meantime by adding the property cache.username2roster.maxLifetime and setting it to 419430400 which 20 minutes (rather than its default 6 hours). For anyone interested, who may be having a similar issue now, this property can be added through the admin interface - no programming required.
I guess the other option would be to reduce the cache.username2roster.size to about a quarter of what you want to account for the calculation issue explained above.