tracker issue : CF-4204518

select a category, or use search below
(searches all categories and all time range)
Title:

Deadlock arising from CFC function requests involving distributed cache and cachePut, cacheGet and cacheRemove calls

| View in Tracker

Status/Resolution/Reason: Closed/Withdrawn/Duplicate

Reporter/Name(from Bugbase): A. B. / ()

Created: 06/17/2019

Components: Caching, Distributed Caching

Versions: 2018

Failure Type: Crash

Found In Build/Fixed In Build: 2018.0.03.314033 /

Priority/Frequency: Normal / Some users will encounter

Locale/System: English / Win 2016

Vote Count: 0

Description of problem:
Our application is mostly based on CFC function requests. These are HTTP requests that invoke CFC functions in CFCs in the instances of our ColdFusion application. In the Administrator of each ColdFusion instance we have set the value of "Maximum Number of Simultaneous CFC Function Requests" to 80. 

The problem is that deadlock often occurs when the number of cachePut, cacheGet and cacheRemove functions exceeds 80. See the attached thread-dump. 

You will see from the thread-dump that the total number of HTTP calls to the component Caching.cfc is exactly 80. Which is the value of the setting "Maximum Number of Simultaneous CFC Function Requests". 

This setting is in essence a first-in-first-out (FIFO) wait queue. ColdFusion implements it by means of the java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject class. Apparently, when the number of cachePut, cacheGet and cacheRemove functions exceeds 80, deadlock ensues. 

This is happening despite the fact that net.sf.ehcache.Cache is thread-safe. In fact, another consequence of the deadlock is that the requests exceed any value you set for the application's request-time-out (120 seconds in our case).

Steps to Reproduce:
1) Create several instances of ColdFusion Enterprise 2018 Update 3 on each of two or more 64-Bit Windows Server 2016 machines. Each machine has 100GB RAM or more. 
2) Use a load-balancer to manage requests between the machines.
3) The Java version to use is Java SE 11.0.3(LTS), the one published by Adobe.  (For example, ensure the instances use, in total, less than 80% of RAM and configure the Java Virtual Machine of each instance with -Xms8192m -Xmx8192m -XX:+UseG1GC).
4) Configure each instance to use Ehcache 2.10.6 together with Terracotta 4.3.6 (Our ehcache.xml and terracotta-server.log files are attached)
5) Configure the instances for distributed cache. That is, configure them to share the same cacheRegions. 
6) In the Administrator of each ColdFusion instance, set the value of "Maximum Number of Simultaneous CFC Function Requests" to 50, say. 
7) The instances should share an identical test application. The application will consist of CFCs, whose functions are invoked by means of HTTP. In particular, the requests in the test will involve a combination functions containing the calls cachePut, cacheGet and cacheRemove.
8) Test by sending a high number of CFC function requests to the (load-balanced) application. Suppose the total number of ColdFusion instances in your test environment is T. Suppose also that your setting for "Maximum Number of Simultaneous CFC Function Requests" is 50. Then test with a load of 50*T or more simultaneous requests.

Actual Result:
Deadlock.
ColdFusion holds the N request threads in a perpetual running state, whwre N is the setting for "Maximum Number of Simultaneous CFC Function Requests". Every subsequent CFC function request is queued. See Fusionreactor printscreen.

Expected Result:
No deadlock.
No requests exceeding the request-timeout limit.

Any Workarounds:
Non known

Attachments:

Comments:

I have corresponded often about this issue with the ColdFusion Team. My gratitude goes to Deepraj and Sandip for their valuable suggestions. Here then is the latest update. I did in the meantime drop a few questions in the Terracotta-Ehcache forum. The ensuing discussions ended in what we now consider a possible solution. The solution involves the 4 settings nonstop, rejoin, l2.l1reconnect.enabled and l2.l1reconnect.timeout.millis. They are to be implemented thus: In the client's ehcache.xml: <cache> <terracotta clustered="true"> <nonstop enabled="true"/> </terracotta> </cache> <terracottaConfig rejoin="true" /> In Terracotta server's tc-config.xml: <tc-properties> <property name="l2.l1reconnect.enabled" value="true" /> <property name="l2.l1reconnect.timeout.millis" value="60000" /> </tc-properties> With the likely possibility of extending the nonstop setting to <nonstop immediateTimeout="false" timeoutMillis="60000"> <timeoutBehavior type="exception" /> </nonstop> It is a relief that this is more or less the same solution that the ColdFusion Team eventually found. Nevertheless, a ColdFusion issue remains: the deadlock starts because the requests continue running beyond the application’s request-timeout (set in the ColdFusion Administrator).
Comment by A. B.
31063 | July 23, 2019 02:01:07 PM GMT
As there is a workaround to prevent deadlock, this report is now about requests exceeding the application's request-time-out. Please see https://tracker.adobe.com/#/view/CF-4204976
Comment by A. B.
31110 | August 07, 2019 03:20:26 PM GMT
A. B, I'm closing this since, you've logged CF-4204976. Not sure why we needed a new bug though.
Comment by Piyush K.
31766 | November 06, 2019 01:08:23 PM GMT