tracker issue | What iT iS dESign studios

Title:

CF10 Update 14 breaks cluster replication / communication

| View in Tracker

Status/Resolution/Reason: Closed/Fixed/Fixed

Reporter/Name(from Bugbase): A S / A S (A S)

Created: 11/25/2014

Components: Web Container (Tomcat)

Versions: 10.0

Failure Type: Non Functioning

Found In Build/Fixed In Build: Final / Update 23

Priority/Frequency: Major / Some users will encounter

Locale/System: English / Linux RH Enterprise 6

Vote Count: 1

Problem Description:We have had a small cluster of two instances working until updating to CF10 update 14.  Now we receive communication errors

Steps to Reproduce: Create CF cluster.  Update to update 14.  Restart instances.  View coldfusion-error.log of instances.

Actual Result: One or both instances are hung/dead

Expected Result:Both instances restart

Any Workarounds:  Downgrade to update 13.

----------------------------- Additional Watson Details -----------------------------

Watson Bug ID:	3857664

Reason:	PRNeedInfo

External Customer Info:
External Company:  
External Customer Name: GuitsBoy
External Customer Email:  
External Test Config: My Hardware and Environment details:

CentOS 6.6 64 bit, 16GB ram.



On a reboot, or with both instances off, usually both instances will come up.   Once they are up, and you try to restart one instance, you get replication communication errors.  If I remove the secondary IP address to remove traffic, restarting an instance will occasionally work.  But with any traffic at all, restarting the instance results in the instance never fully coming up.



Downgrading to update 13 results in normal operation.

Attachments:

November 26, 2014 00:00:00: 1_cf-error-logs.txt
December 11, 2014 00:00:00: 2_cfusion1-coldfusion-error.log.txt
December 11, 2014 00:00:00: 3_cfusion1-coldfusion-out.log.txt
December 11, 2014 00:00:00: 4_cfusion1-server.xml.txt
December 11, 2014 00:00:00: 5_cfusion2-coldfusion-error.log.txt
December 11, 2014 00:00:00: 6_cfusion2-coldfusion-out.log.txt
December 11, 2014 00:00:00: 7_cfusion2-server.xml.txt

Comments:

Additional info found in forums:
https://forums.adobe.com/thread/1643372 

When restarting an instance:
# /opt/coldfusion10/cfusion2/bin/coldfusion restart
Restarting ColdFusion 10 server instance named cfusion2 ...
Stopping ColdFusion 10 server instance named cfusion2, please wait
Nov 25, 2014 11:39:57 AM com.adobe.coldfusion.launcher.Launcher stopServer
SEVERE: Shutdown Port 8009is not active. Stop the server only after it is started.
ColdFusion 10 server instance named cfusion2 has been stopped
Starting ColdFusion 10 server instance named cfusion2 ...
The ColdFusion 10 server instance named cfusion2 is starting up and will be available shortly.
nohup: appending output to `nohup.out'
======================================================================
ColdFusion 10 server instance named cfusion2 has been started.
ColdFusion 10 will write logs to /opt/coldfusion10/cfusion2/logs/coldfusion-out.log
======================================================================

Comment by External U.

9954 | November 25, 2014 11:34:36 AM GMT

I should note that this a production affecting issue.  When the restarted instance does not fully start up, Tomcat DeltaManager still adds the member, resulting in requests being forwarded to a dead/hung instance.  Uncompleted requests will queue up until apache hangs, bringing down the web server.

Comment by External U.

9955 | November 25, 2014 12:35:53 PM GMT

@GuitsBoy We tried to setup the configuration you describe here, and on the forums, but are unable to reproduce this issue. The error logs you have attached do not provide a lot of useful information. 

Could you also send across the server logs (coldfusion-out.log)?
Also, do make note of any configuration change that deviate from the defaults.

Comment by Immanuel N.

9956 | December 11, 2014 08:24:48 AM GMT

Hi Immanuel, thanks for the response.

This issue seems to depend on at least moderate load on the machine, which has proven difficult to reproduce in a lab environment.  Since our production boxes have all been rolled back to update 13 for some time, I have attached log files from one of our developmental boxes.  It does not have nearly the same load as a production box, but it seems to have the same issue from time to time.  On some restarts it comes back OK, but on other restarts, it seems to break communication between the two cluster members.  Again, the hung instance is in a half dead state, so apache continues to load balance half the requests to it, which will never complete.  If I restart the hung CF instance, tomcat will sense a cluster member is down, apache will move the requests over to the functioning instance and all is well.

There's really no heavy customization here.  We added the SSL port for configuration, and made some security changes for PCI compliance, but nothing having to do with the functionality of clustering.

Please let me know what else you would like to take a look at, and Id be happy to attach / test / troubleshoot.

Thanks, 
-Tony

Comment by External U.

9957 | December 11, 2014 09:24:52 AM GMT

Does anyone know if this bug has been fixed in update 15?  I have not yet applied update 15 for fear it will break our production environment again.

Comment by External U.

9958 | February 27, 2015 03:01:05 PM GMT

Any update on this? I can't even get the clustered instances to start after applying update 14. If I revert back to 13, everything works normal. 

Here is the error:

  INFO: Manager [localhost#/]; session state send at 4/4/15 1:20 AM received in 6,617 ms.
Apr 04 01:21:19 WWW6 coldfusion-error.log:  Apr 04, 2015 1:20:42 AM org.apache.catalina.session.StandardSession tellNew
Apr 04 01:21:19 WWW6 coldfusion-error.log:  SEVERE: Session event listener threw exception
Apr 04 01:21:19 WWW6 coldfusion-error.log:  java.lang.NullPointerException
Apr 04 01:21:19 WWW6 coldfusion-error.log:  	at coldfusion.bootstrap.HttpFlexSessionBootstrap.getListener(HttpFlexSessionBootstrap.java:154)
Apr 04 01:21:19 WWW6 coldfusion-error.log:  	at coldfusion.bootstrap.HttpFlexSessionBootstrap.sessionCreated(HttpFlexSessionBootstrap.java:69)
Apr 04 01:21:19 WWW6 coldfusion-error.log:  	at org.apache.catalina.session.StandardSession.tellNew(StandardSession.java:422)
Apr 04 01:21:19 WWW6 coldfusion-error.log:  	at org.apache.catalina.session.StandardSession.setId(StandardSession.java:394)
Apr 04 01:21:19 WWW6 coldfusion-error.log:  	at org.apache.catalina.ha.session.DeltaSession.setId(DeltaSession.java:275)
Apr 04 01:21:19 WWW6 coldfusion-error.log:  	at org.apache.catalina.ha.session.DeltaManager.handleSESSION_CREATED(DeltaManager.java:1336)
Apr 04 01:21:19 WWW6 coldfusion-error.log:  	at org.apache.catalina.ha.session.DeltaManager.messageReceived(DeltaManager.java:1214)
Apr 04 01:21:19 WWW6 coldfusion-error.log:  	at org.apache.catalina.ha.session.DeltaManager.getAllClusterSessions(DeltaManager.java:803)
Apr 04 01:21:19 WWW6 coldfusion-error.log:  	at org.apache.catalina.ha.session.DeltaManager.startInternal(DeltaManager.java:756)
Apr 04 01:21:19 WWW6 coldfusion-error.log:  	at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
Apr 04 01:21:19 WWW6 coldfusion-error.log:  	at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5476)
Apr 04 01:21:19 WWW6 coldfusion-error.log:  	at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
Apr 04 01:21:19 WWW6 coldfusion-error.log:  	at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1559)
Apr 04 01:21:19 WWW6 coldfusion-error.log:  	at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1549)
Apr 04 01:21:19 WWW6 coldfusion-error.log:  	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
Apr 04 01:21:19 WWW6 coldfusion-error.log:  	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
Apr 04 01:21:19 WWW6 coldfusion-error.log:  	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

Comment by External U.

9959 | April 03, 2015 11:56:11 PM GMT

Just wanted to add that I'm on Win2K

Comment by External U.

9960 | April 04, 2015 12:10:53 AM GMT

Entered to quickly, Windows 2008 R2

Comment by External U.

9961 | April 04, 2015 12:11:46 AM GMT

Hi Sumit,
             Is it possible to provide web.xml file that is there when issue occurs (under C:\ColdFusion10\cfusion\wwwroot\WEB-INF\) ?

Comment by Krishna R.

9962 | May 02, 2015 09:06:23 AM GMT

Krishna,

I just emailed you the files under support issue #186573646. Anit was working on it. All my log files are there as well.

Sumit

Comment by External U.

9963 | May 02, 2015 09:15:58 AM GMT

As of today, this bug is reproducing for us in Coldfusion 11 Update 6. Session sharing is the entire reason we purchased enterprise level licenses and we can not get it to work correctly in our production environment.

Vote by External U.

9972 | August 31, 2015 09:20:39 AM GMT

Changing the multicast address / port fixed the issue. 
A possible cause was that the same multicast address and port, being used across independent clusters. 
Instances belonging to one cluster, were pulling information off instances belonging to another cluster. 

Please respond in case you still face this issue.

Comment by Immanuel N.

9964 | September 29, 2015 03:31:27 AM GMT

Changing the multicast address/port did fixed the reboot issue. But, since we  made this change, we started noticing error in some of our file upload (to RAM) processes. We keep seeing file not found error. Which means the sticky session is probably not working. 

Also, this worked without issues prior to update 14, so, it will be great to find the root cause and fix that, so we don't need to change the multicast address/port.

Comment by External U.

9965 | September 29, 2015 08:49:17 AM GMT

Sumit,
Do you see any exceptions thrown for the file not found error? 
Could you please send us those, along with any snippet of code you may have, to help us investigate?

Comment by Immanuel N.

9966 | December 22, 2015 01:32:43 AM GMT

Here is the error:
ava.io.FileNotFoundException: ram:///dataImport at coldfusion.vfs.VFSFileFactory.getInputStream(VFSFileFactory.java:236) at coldfusion.tagext.io.FileUtils.readFile(FileUtils.java:958) at coldfusion.tagext.io.FileTag.read(FileTag.java:484) at coldfusion.tagext.io.FileTag.doStartTag(FileTag.java:302) at coldfusion.runtime.CfJspPage._emptyTcfTag


Here is what I'm doing:
Request 1 uploads the file:

<cffile action="upload" filefield="uploadfile" destination="ram:///dataImport" nameconflict="overwrite" >

Then Request 2 Reads it:
<cffile action="read" file="ram:///dataImport" variable="fileData" >

The error happens after changing the Multicast address/port. No issue with default setup. 

Sticky session is enabled in cluster.

Comment by External U.

9967 | December 22, 2015 10:27:31 AM GMT

We recently updated to CF10 Update 18, and I've been working on this issue once again.

I found a number of compounding problems.

Firstly we were able to get rid of a lot of errors in our logs by giving each box a unique multiport address.  We have 4 physical boxes, each with two instances, and it seems we were getting some crosstalk.

Second, we were able to remove more errors by setting channelSendOptions="6".

Third, it appears that the connector in update 1d was indeed broken, since it would leave a hung instance in the cluster, still forwarding requests to the dead instance.  This appears to have been fixed somewhere between update 14 and update 18.

Lastly, we found that forcing SSL on our CFIDE/administrator instances was causing problems where more than half the time, the instance would die on restart.  Nothing written to logs, no errors, still shows in the process list, but the instance was dead dead dead.  By removing SSL, we had 100% success rate on restarts.  Unfortunately we still need SSL, so we kept troubleshooting and found that the cipher string seems to affect restart reliability.  We had less than 50% success with just one cipher in the string, "TLS_RSA_WITH_AES_128_CBC_SHA".   But by adding a common string containing a handful of ciphers, we have increased the success to about 80-90%.  Not sure why this would be the case, and again, there are no errors to go off of.

I have created a separate bug for the last issue, but figured I would cross post here in case it helps.

Comment by External U.

9968 | March 16, 2016 04:01:40 PM GMT

Unfortunately my joy was short lived.  As soon as we put the box back into production, it immediately started failing on restart.  Seems any moderate load is enough to cause deltamanager to hang indefinitely.

After much trial and error with a fresh centOS install, fresh CF10 update 18 install, and some load generated with apache jmeter, I believe that DeltaManager is fundamentally broken, at least in coldfusion's implementation of it.  Neither the default multicast nor static clustering with deltaManager would result in a working test while under load.  Even changing all session replication timeouts to obscenely high values (120 minutes, etc) would still result in a hung instance.

But we do find that using backupManager instead of deltaManager, in conjunction with higher timouts, does result in a working test box, at least under artificial load.  Unfortunately this limits our clusters to only two nodes.

Does anyone have any suggestions as to why deltamanager is broken, but backup manager seems to work?

Comment by External U.

9969 | March 28, 2016 12:59:48 PM GMT

After a few days in production, it seems that changing from DeltaManager to BackupManager is working well.  Unfortunately this limits us to smaller clusters or only 2 nodes/instances, but its certainly better than having to pull the box out of production just to restart an instance.

Comment by External U.

9970 | March 31, 2016 04:26:50 PM GMT

There are multiple issues mentioned in the comments below, and some are already being addressed. 

Request you to reach out to us at cfinstal@adobe.com, or log seperate bugs to address specific issues mentioned here.

Comment by Immanuel N.

9971 | September 30, 2016 07:47:59 AM GMT

A handful of issues have been fixed on later updates. 
Please report any issues with clustering that are still present with the latest update (HF23) installed.

Comment by Immanuel N.

29563 | November 08, 2017 05:27:16 AM GMT

tracker issue : CF-3857664

CF10 Update 14 breaks cluster replication / communication