tracker issue : CF-4202859

select a category, or use search below
(searches all categories and all time range)
Title:

ColdFusion uses unsynchronized WeakHashMap in Remote Method Invocation during cache replication. This occasionally leads to infinite looping, hence 100% CPU usage.

| View in Tracker

Status/Resolution/Reason: To Fix//Investigate

Reporter/Name(from Bugbase): A. Bakia / ()

Created: 06/13/2018

Components: Caching, General, Performance

Versions: 11.0

Failure Type: Crash

Found In Build/Fixed In Build: 11,0,14,307976 /

Priority/Frequency: Normal / All users will encounter

Locale/System: English / Win 2012 Server x64

Vote Count: 0

Problem Description: 
Our system is made up of 3 Windows servers. Each server runs its own ColdFusion 11 application server, comprising 13 ColdFusion instances. Our sites get thousands of unique visitors daily. A load-balancer manages the load between the three servers. 

We also use ColdFusion's in-built Ehcache for cache management between the ColdFusion instances. ColdFusion implements Ehcache by internally using Java Remote Method Invocation (RMI) for the point-to-point messaging between the cache nodes. 
 
Occasionally (every other week), the RMI threads in some of the instances loop indefinitely. The result is invariably 100% CPU usage, leading to a server crash. The only solutuion is then to restart the affected instances. this can be 1 instance, two or more. 

When we do a thread-dump analysis during each high-CPU incident, we find that the cause is always RMI TCP Connection threads. 

Steps to Reproduce:
1) Setup: ColdFusion 11 Enterprise on 3 Windows servers, each running at least 3 ColdFusion instances. 
2) Use a load-balancer between the servers and configure the instances for Ehcache Replicated Caching using RMI. That is, the instances may share cached data. 
3) Test the system with high load (thoudsands of unique users per day), on an application that involves the reading, updating, storing and deleting of large amounts of cache data.  

Actual Result:
The server's CPU usage will occasionally rise to 100%. The cause is apparently the RMI TCP Connection threads involved in cache replication. See stack traces attached.

The RMI TCP Connection threads invoke the method java.util.WeakHashMap.put(WeakHashMap.java:453), which sooner or later results in an infinite loop. We noticed that the looping threads always originate from at least two different servers. See the thread-dump analysis attached (It was performed at http://fastthread.io/).

In fact, this same issue has been identified and confirmed elsewhere. Examples are:

http://adambien.blog/roller/abien/entry/endless_loops_in_unsychronized_weakhashmap
https://access.redhat.com/solutions/55161
https://issues.jenkins-ci.org/browse/JENKINS-47725
https://issues.jenkins-ci.org/browse/JENKINS-41797
https://jira.pentaho.com/browse/PDI-14882

Expected Result:
Ehcache Replicated Caching using RMI does not result in infinite calls to java.util.WeakHashMap.put(WeakHashMap.java:453) and to 100% CPU usage.

Any Workarounds:
None known. For the time being, we just restart the affected instances.

Nevertheless, I would like to share with the ColdFusion Team 2 main ideas for a solution, which I came across in my research: 
(1) replace the WeakHashMap.put() call either with a synchronized Java Collections alternative;
or
(2)  wrap object values within WeakReferences before inserting, as in: myMap.put(key, new WeakReference(value)), and then unwrapping upon each get() call.

Further references:
https://github.com/jenkinsci/workflow-api-plugin/compare/1d1c6833fe1a...76f501eb6880
https://github.com/jenkinsci/workflow-api-plugin/commit/76f501eb688095d032bc5b9e4c9a55d0aa6b4bdd
https://github.com/jenkinsci/script-security-plugin/compare/7d56ac842117...598d7ef3040b
https://github.com/pentaho/pdi-osgi-bridge/pull/38

Attachments:

Comments:

Please read "solution" for "solutuion"
Comment by A. B.
29031 | June 13, 2018 03:30:19 PM GMT
Related bug reports: https://tracker.adobe.com/#/view/cf-4099820 https://tracker.adobe.com/#/view/CF-3295469
Comment by A. B.
29111 | June 20, 2018 02:57:03 PM GMT
A Bakia, don't see an entry point for CFML code in the stack trace you've shared. Is that something you can drill down to in the thread dump or CF logs or in you CFM application perhaps. That could help with simulating the CPU spike at my end. One of the bugs you've mentioned was fixed and released with update 11 of CF11.
Comment by Piyush K.
29133 | June 25, 2018 10:54:01 AM GMT
Hi Piyush Kumar, Thanks for your reply. To start with, the top of the stacktrace is: java.lang.Thread.State: RUNNABLE at java.util.WeakHashMap.put(WeakHashMap.java:453) at coldfusion.runtime.StructWrapper.readResolve(StructWrapper.java:52) ... The call that is looping indefinitely is java.util.WeakHashMap.put().
Comment by A. B.
29136 | June 26, 2018 03:33:52 PM GMT
To respond to the NeedMoreInfo status of this report: it is apparent that the call java.util.WeakHashMap.put() originates from coldfusion.runtime.StructWrapper. Please let me know if you require more information.
Comment by A. B.
29367 | July 22, 2018 04:07:44 PM GMT
A. Bakia, I am not sure if it is practical or a good idea to take up the fixes you are proposing based on your analysis and research of the issue you are facing on your environment. I've tried a simple test script where-in I an serializing and deserializing a struct and put it under load testing. 100 concurrent users for a duration of 15 mins on CF11 HF18. I did not observe any errors or slowed response. I took some thread dumps while the server was under load.. but did not find stack traces with signature similar to yours. I have attached test case I used for the simulation. May be you can provide some additional input on recreating the issue at my end. I understand your application set up is on CF11. In case you have the same or something similar on CF2016 or 2018, do we see the issue there as well?
Comment by Piyush K.
30741 | May 15, 2019 02:11:02 PM GMT
Hi Piyush K., Thank you for looking into this. Because of this issue, we months ago stopped using Ehcache Replicated Caching by means of RMI. We now use distributed caching by means of Terracotta. We also upgraded all our instances to the latest version of ColdFusion 2018. I am therefore unable to investigate the issue any further. Nevertheless, the root cause of the problems remains in ColdFusion. Recall that the offending stacktrace is: java.lang.Thread.State: RUNNABLE at java.util.WeakHashMap.put(WeakHashMap.java:453) at coldfusion.runtime.StructWrapper.readResolve(StructWrapper.java:52) ... The call that is looping indefinitely is java.util.WeakHashMap.put(). This call is not threadsafe.
Comment by A. B.
30786 | May 20, 2019 06:24:08 PM GMT
Hi Piyush, You and I have exchanged quite some correspondence on this. Things have moved on. In the past months, our company has deprecated ColdFusion 11. ColdFusion 2016, too, has been archived and the only version of ColdFusion running on our servers is ColdFusion 2018. Therefore I am unable to research this issue further. Nevertheless, thank you for staying with this and for your helpful suggestions.
Comment by A. B.
33271 | March 09, 2020 10:12:46 AM GMT
Hi Piyush, This is to update you. I am yet unable to reproduce the issue underlying CF-4205867 and CF-4202859. Nevertheless, from what I know so far, they are related: deadlock due to lack of thread-safety in HashMap. So you were justified in withdrawing CF-4205867 as a duplicate of CF-4202859. As I could not recreate our original test environment, I decided to write new test code. I have kept it simple: no distributed (Terracotta) cache to start with. Just the Ehcache in ColdFusion. In the test, one request uses a while-true loop to continually do a cachePut, then a cacheRemove of the same object. In the meantime, a second request, having a request-timeout of X seconds, attempts a cacheGet of the object. I continue to play with different combinations of variables in the hope of reproducing deadlock or 100% CPU. As soon as I get a result of interest, I shall let you know. Kind regards,
Comment by A. B.
33272 | March 13, 2020 04:27:22 PM GMT