Title:
ColdFusion uses unsynchronized WeakHashMap in Remote Method Invocation during cache replication. This occasionally leads to infinite looping, hence 100% CPU usage.
| View in TrackerStatus/Resolution/Reason: To Fix//Investigate
Reporter/Name(from Bugbase): A. Bakia / ()
Created: 06/13/2018
Components: Caching, General, Performance
Versions: 11.0
Failure Type: Crash
Found In Build/Fixed In Build: 11,0,14,307976 /
Priority/Frequency: Normal / All users will encounter
Locale/System: English / Win 2012 Server x64
Vote Count: 0
Problem Description:
Our system is made up of 3 Windows servers. Each server runs its own ColdFusion 11 application server, comprising 13 ColdFusion instances. Our sites get thousands of unique visitors daily. A load-balancer manages the load between the three servers.
We also use ColdFusion's in-built Ehcache for cache management between the ColdFusion instances. ColdFusion implements Ehcache by internally using Java Remote Method Invocation (RMI) for the point-to-point messaging between the cache nodes.
Occasionally (every other week), the RMI threads in some of the instances loop indefinitely. The result is invariably 100% CPU usage, leading to a server crash. The only solutuion is then to restart the affected instances. this can be 1 instance, two or more.
When we do a thread-dump analysis during each high-CPU incident, we find that the cause is always RMI TCP Connection threads.
Steps to Reproduce:
1) Setup: ColdFusion 11 Enterprise on 3 Windows servers, each running at least 3 ColdFusion instances.
2) Use a load-balancer between the servers and configure the instances for Ehcache Replicated Caching using RMI. That is, the instances may share cached data.
3) Test the system with high load (thoudsands of unique users per day), on an application that involves the reading, updating, storing and deleting of large amounts of cache data.
Actual Result:
The server's CPU usage will occasionally rise to 100%. The cause is apparently the RMI TCP Connection threads involved in cache replication. See stack traces attached.
The RMI TCP Connection threads invoke the method java.util.WeakHashMap.put(WeakHashMap.java:453), which sooner or later results in an infinite loop. We noticed that the looping threads always originate from at least two different servers. See the thread-dump analysis attached (It was performed at http://fastthread.io/).
In fact, this same issue has been identified and confirmed elsewhere. Examples are:
http://adambien.blog/roller/abien/entry/endless_loops_in_unsychronized_weakhashmap
https://access.redhat.com/solutions/55161
https://issues.jenkins-ci.org/browse/JENKINS-47725
https://issues.jenkins-ci.org/browse/JENKINS-41797
https://jira.pentaho.com/browse/PDI-14882
Expected Result:
Ehcache Replicated Caching using RMI does not result in infinite calls to java.util.WeakHashMap.put(WeakHashMap.java:453) and to 100% CPU usage.
Any Workarounds:
None known. For the time being, we just restart the affected instances.
Nevertheless, I would like to share with the ColdFusion Team 2 main ideas for a solution, which I came across in my research:
(1) replace the WeakHashMap.put() call either with a synchronized Java Collections alternative;
or
(2) wrap object values within WeakReferences before inserting, as in: myMap.put(key, new WeakReference(value)), and then unwrapping upon each get() call.
Further references:
https://github.com/jenkinsci/workflow-api-plugin/compare/1d1c6833fe1a...76f501eb6880
https://github.com/jenkinsci/workflow-api-plugin/commit/76f501eb688095d032bc5b9e4c9a55d0aa6b4bdd
https://github.com/jenkinsci/script-security-plugin/compare/7d56ac842117...598d7ef3040b
https://github.com/pentaho/pdi-osgi-bridge/pull/38
Attachments:
Comments: