tracker issue : CF-4129174

select a category, or use search below
(searches all categories and all time range)
Title:

Coldfusion 10 instance randomly will not restart

| View in Tracker

Status/Resolution/Reason: Closed/Withdrawn/CannotReproduce

Reporter/Name(from Bugbase): A S / A S (A S)

Created: 03/16/2016

Components: Installation/Config, Connector

Versions: 10.0

Failure Type: Non Functioning

Found In Build/Fixed In Build: Final /

Priority/Frequency: Major / Some users will encounter

Locale/System: English / Linux RH Enterprise 6

Vote Count: 0

Problem Description:
After updating to ANY update 14 or better, we experience random dead instances on restart.  Running ps aux will show that the instance of java/cfusion is running, but you cannot connect to the CFIDE.  The mod_jk status shows that the hung instance flaps between ERR/PRB state and ERR state, occasionally sending requests to the hung backend, but then timing out and rolling over to the other live instance.  Restarting the hung instance has a 50/50 chance of working or hanging again.  Issuing a stop, waiting a minute, and issuing a start command does not work any better than the simple restart.

This issue affects both instances, and is reproducible across the four physical machines.  All conflicting ports and multicast addresses have already been changed to unique.  

When the instance hangs on restart, nothing is written to coldfusion-out.log

Additionally, when the instance is hung, I issuing a status command hangs as well.  When the instance is up and running, that status command return incorrect information that CF is not running.  And in the enterprise instance manager, none of the instances show as running.

Steps to Reproduce:
Load balance two sub-instances, install CF update 18, restart instances.

Actual Result:
Occasional hung instance.

Expected Result:
Clean restart every time.

Any Workarounds:
Keep restarting the CF instance until it comes up.

----------------------------- Additional Watson Details -----------------------------

Watson Bug ID:	4129174

Reason:	PRHaveInfo

External Customer Info:
External Company:  
External Customer Name: A S
External Customer Email:  
External Test Config: My Hardware and Environment details:

Attachments:

Comments:

I forgot to mention that the master as well as both sub instances have been modified to use SSL as per the coldfusion 10 hardening guide.
Comment by External U.
3285 | March 16, 2016 10:46:18 AM GMT
After backing out our older PCI changes (that worked fine until update 14) the server instances have a 100% success rate. It must have something to do with the SSL/TLS modifications.
Comment by External U.
3286 | March 16, 2016 11:45:28 AM GMT
Seems to be very dependent on Cipher string. CF10 Update 18, Redhat 6.7, Java 1.7.0_95 Initially I had only one cipher in the string "TLS_RSA_WITH_AES_128_CBC_SHA" and this caused the worst amount of problems on restart, at least than 50% success rate. I have since switched to the cipher string below, and now have 80%-90% success rate when restarting. Without anything interesting being logged, I cant figure out what ciphers may help vs hurt. TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_RSA_WITH_RC4_128_SHA, TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA,SSL_RSA_WITH_RC4_128_SHA
Comment by External U.
3287 | March 16, 2016 03:51:54 PM GMT
The cipher string wound up being a red herring, and led us in the wrong direction. Even with SSL completely removed, CF10 update 18 was still broken. I decided to try a fresh install of CentOS 6, CF10 update 18, and test again using apache jmeter to generate some load on the box. CF10 update 18 is broken out of the box. There is a problem with DeltaManager in tomcat, that wont replicate the sessions (even after 120 minute timeout), and therefore CF refuses to start. I played around with various settings, multicast and static, but I cannot get DeltaManager to play nice while under load. BackupManager seems to work under load, but is limited to clusters of only 2 instances. As a temporary workaround, we are using backupmanager with smaller clusters, and seems to be working well enough in production for the last couple of days.
Comment by External U.
3288 | March 31, 2016 04:22:04 PM GMT
A S, Can you confirm if this issue still occurs with ColdFusion 10 hotfix 19?
Comment by Immanuel N.
3289 | May 24, 2016 01:14:05 AM GMT
Yes, I can confirm that DeltaManager is still broken after update 19. I reverted my fresh test box to use DeltaManager on update 18, and generated some load with jmeter on a page that does a few cfhttp calls in a session. I then restarted cf instances and verified that servers would hang when starting back up. After removing load and getting both servers back up, I updated them both to update 19, and restarted again. They restart fine without any load, but when I apply load with jmeter again, they hang on startup the same exact way. Please let me know what information you would like me to post up.
Comment by External U.
3290 | May 26, 2016 11:23:12 AM GMT
The hung server instance shows this in the nohup.out file: SEVERE: Manager [localhost#]: No session state send at 5/26/16 12:48 PM received, timing out after 60,092 ms. May 26, 2016 12:49:54 PM org.apache.catalina.session.StandardSession tellNew SEVERE: Session event listener threw exception java.lang.NullPointerException at coldfusion.bootstrap.HttpFlexSessionBootstrap.getListener(HttpFlexSessionBootstrap.java:154) at coldfusion.bootstrap.HttpFlexSessionBootstrap.sessionCreated(HttpFlexSessionBootstrap.java:69) at org.apache.catalina.session.StandardSession.tellNew(StandardSession.java:416) at org.apache.catalina.session.StandardSession.setId(StandardSession.java:388) at org.apache.catalina.ha.session.DeltaSession.setId(DeltaSession.java:275) at org.apache.catalina.ha.session.DeltaManager.handleSESSION_CREATED(DeltaManager.java:1277) at org.apache.catalina.ha.session.DeltaManager.messageReceived(DeltaManager.java:1155) at org.apache.catalina.ha.session.DeltaManager.getAllClusterSessions(DeltaManager.java:777) at org.apache.catalina.ha.session.DeltaManager.startInternal(DeltaManager.java:730) at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:147) at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5593) at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:147) at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1572) at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1562) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) May 26, 2016 12:49:54 PM org.apache.catalina.session.StandardSession tellNew SEVERE: Session event listener threw exception java.lang.NullPointerException at coldfusion.bootstrap.HttpFlexSessionBootstrap.getListener(HttpFlexSessionBootstrap.java:154) at coldfusion.bootstrap.HttpFlexSessionBootstrap.sessionCreated(HttpFlexSessionBootstrap.java:69) at org.apache.catalina.session.StandardSession.tellNew(StandardSession.java:416) at org.apache.catalina.session.StandardSession.setId(StandardSession.java:388) at org.apache.catalina.ha.session.DeltaSession.setId(DeltaSession.java:275) at org.apache.catalina.ha.session.DeltaManager.handleSESSION_CREATED(DeltaManager.java:1277) at org.apache.catalina.ha.session.DeltaManager.messageReceived(DeltaManager.java:1155) at org.apache.catalina.ha.session.DeltaManager.getAllClusterSessions(DeltaManager.java:777) at org.apache.catalina.ha.session.DeltaManager.startInternal(DeltaManager.java:730) at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:147) at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5593) at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:147) at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1572) at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1562) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) May 26, 2016 12:49:54 PM org.apache.catalina.session.StandardSession tellNew
Comment by External U.
3291 | May 26, 2016 01:12:31 PM GMT
For what it's worth, if the artificial load hits a much simpler .cfm page that does not include any cfhttp calls, there is a much lower chance of having an instance hang. Perhaps its something to do with the size of the sessions that are being replicated? Unfortunately extending the timeouts, even to hours, does not help the situation. Also, we are already have -Djava.security.egd=file:/dev/./urandom in our jvm.config to speed up web calls.
Comment by External U.
3292 | May 26, 2016 02:14:26 PM GMT