Description of problem:
Our application is mostly based on CFC function requests. These are HTTP requests that invoke CFC functions in CFCs in the instances of our ColdFusion application. In the Administrator of each ColdFusion instance we have set the value of "Maximum Number of Simultaneous CFC Function Requests" to 80.
The problem is that deadlock often occurs when the number of cachePut, cacheGet and cacheRemove functions exceeds 80. See the attached thread-dump.
You will see from the thread-dump that the total number of HTTP calls to the component Caching.cfc is exactly 80. Which is the value of the setting "Maximum Number of Simultaneous CFC Function Requests".
This setting is in essence a first-in-first-out (FIFO) wait queue. ColdFusion implements it by means of the java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject class. Apparently, when the number of cachePut, cacheGet and cacheRemove functions exceeds 80, deadlock ensues.
This is happening irrespective of the value you set for the application's request-time-out (currently 120 seconds in our case).
Steps to Reproduce:
1) Create several instances of ColdFusion Enterprise 2018 Update 3 on each of two or more 64-Bit Windows Server 2016 machines. Each machine has 100GB RAM or more.
2) Use a load-balancer to manage requests between the machines.
3) The Java version to use is Java SE 11.0.3(LTS), the one published by Adobe. (For example, ensure the instances use, in total, less than 80% of RAM and configure the Java Virtual Machine of each instance with -Xms8192m -Xmx8192m -XX:+UseG1GC).
4) Configure each instance to use Ehcache 2.10.6 together with Terracotta 4.3.6 (Our ehcache.xml and terracotta-server.log files are attached)
5) Configure the instances for distributed cache. That is, configure them to share the same cacheRegions.
6) In the Administrator of each ColdFusion instance, set the value of "Maximum Number of Simultaneous CFC Function Requests" to 50, say.
7) The instances should share an identical test application. The application will consist of CFCs, whose functions are invoked by means of HTTP. In particular, the requests in the test will involve a combination functions containing the calls cachePut, cacheGet and cacheRemove.
8) Test by sending a high number of CFC function requests to the (load-balanced) application. Suppose the total number of ColdFusion instances in your test environment is T. Suppose also that your setting for "Maximum Number of Simultaneous CFC Function Requests" is 50. Then test with a load of 50*T or more simultaneous requests.
Actual Result:
Deadlock.
ColdFusion holds the N request threads in a perpetual running state, whwre N is the setting for "Maximum Number of Simultaneous CFC Function Requests". Every subsequent CFC function request is queued. See Fusionreactor printscreen.
Expected Result:
No deadlock.
No requests exceeding the request-timeout limit.
Any Workarounds:
1) The broken connection between the client (CF instance) and the cache server was responsible for the deadlock. This resulted in the client maintaining requests that wait for the connection to resume.
The workaround therefore forces Terracotta/Ehcache to reconnect whenever it drops a connection with the client. workaroud involves the 4 settings nonstop, rejoin, l2.l1reconnect.enabled and l2.l1reconnect.timeout.millis. They are to be implemented thus:
In the client's ehcache.xml:
<cache>
<terracotta clustered="true">
<nonstop enabled="true"/>
</terracotta>
</cache>
<terracottaConfig rejoin="true" />
In Terracotta server's tc-config.xml:
<tc-properties>
<property name="l2.l1reconnect.enabled" value="true" />
<property name="l2.l1reconnect.timeout.millis" value="60000" />
</tc-properties>
With the likely possibility of extending the nonstop setting to
<nonstop immediateTimeout="false" timeoutMillis="60000">
<timeoutBehavior type="exception" />
</nonstop>
2) No workaround is as yet known for stopping requests from exceeding the request-timeout limit.
Attachments:
Comments: