Nodes randomly losing communication with cluster

Article
2013-02-20

Question

_{Wednesday, February 20, 2013 6:07 PM}

We have a 6 node production cluster. We are on Windows Server 2008 R2 and SQL Server 2008 R2. At any time, a node will loss communication with the cluster causing every instance on that node to failover to other nodes. The event logs are very generic - event ids 1006 and 1335. We disabled tcp offloading, done nic driver updates, installed various patches (KB2524478, 2552040, 2685891, 2687741, 2754804), but its still happening. If anyone has any information that can help, please let me know. Here is what is happening in the cluster log at the time of the disconnect.

00000950.00000b14::2013/02/20-12:37:09.511 WARN [CHANNEL ~] failure, status WSAETIMEDOUT(10060)

00000950.00000ae4::2013/02/20-12:37:09.511 WARN [CHANNEL ~] failure, status WSAECONNRESET(10054)

00000950.000009cc::2013/02/20-12:37:09.518 INFO [ACCEPT] :::~3343~: Accepted inbound connection from remote endpoint:~51451~. 00000950.0000133c::2013/02/20-12:37:09.518 INFO [SV] Route local (~) to remote (:~51451~) exists. Forwarding to alternate path. 00000950.0000133c::2013/02/20-12:37:09.518 INFO [SV] Securing route from (~) to remote (:~51451~).

00000950.0000133c::2013/02/20-12:37:09.518 INFO [SV] Got a new incoming stream from:~51451~

00000950.00000b14::2013/02/20-12:37:09.519 INFO [PULLER evproddb13] Parent stream has been closed.

00000950.00000b14::2013/02/20-12:37:09.519 ERR [NODE] Node 4: Connection to Node 7 is broken. Reason Closed(1236)' because of 'channel to remote endpoint 3343~ has failed with status WSAETIMEDOUT(10060)'

00000950.00000b14::2013/02/20-12:37:09.519 WARN [NODE] Node 4: Initiating reconnect with n7.

00000950.00000b14::2013/02/20-12:37:09.519 INFO [MQ-evproddb13] Pausing

00000950.00001988::2013/02/20-12:37:09.519 INFO [Reconnector-evproddb13] Reconnector from epoch 1 to epoch 2 waited 00.000 so far. 00000950.00001988::2013/02/20-12:37:09.519 INFO [CONNECT]:~3343~ from local ~: Established connection to remote endpoint:~3343~. 00000950.00001988::2013/02/20-12:37:09.519 INFO [Reconnector-evproddb13] Successfully established a new connection. 00000950.00001988::2013/02/20-12:37:09.520 INFO [SV] Route local (:~52834~) to remote evproddb13 (~) exists. Forwarding to alternate path. 00000950.00001988::2013/02/20-12:37:09.520 INFO [SV] Securing route from (:~52834~) to remote evproddb13 (3343~).

00000950.00001988::2013/02/20-12:37:09.520 INFO [SV] Got a new outgoing stream to evproddb13 at 3343~

00000950.00000ae4::2013/02/20-12:37:09.525 ERR [NODE] Node 4: channel (write) to node 7 is broken. Reason Closed(1236)' because of 'channel to remote endpoint:~3343~ has failed with status WSAECONNRESET(10054)'

00000950.00000ae4::2013/02/20-12:37:09.525 WARN [NODE] Node 4: Initiating reconnect with n7.

00000950.00000ae4::2013/02/20-12:37:09.525 INFO [MQ-evproddb13] Pausing

00000950.00000b14::2013/02/20-12:37:09.525 INFO [NODE] Node 4: Cancelling reconnector...

00000950.00002318::2013/02/20-12:37:09.525 INFO [Reconnector-evproddb13] Reconnector from epoch 1 to epoch 2 waited 00.000 so far. 00000950.00000b14::2013/02/20-12:37:09.525 INFO [CONNECT] 3343~ from local 14:~0~: Established connection to remote endpoint 3343~.

00000950.00000b14::2013/02/20-12:37:09.525 INFO [Reconnector-evproddb13] Successfully established a new connection. 00000950.00000b14::2013/02/20-12:37:09.525 INFO [SV] Route local (:~52836~) to remote evproddb13 (:~3343~) exists. Forwarding to alternate path. 00000950.00000b14::2013/02/20-12:37:09.526 INFO [SV] Securing route from (:~52836~) to remote evproddb13 (:~3343~). 00000950.00000b14::2013/02/20-12:37:09.526 INFO [SV] Got a new outgoing stream to evproddb13 at:~3343~

All replies (8)

_{Friday, March 8, 2013 1:30 AM ✅Answered}

In this case I suggest you to contact Microsoft support for further investigation

_{Thursday, February 21, 2013 5:49 AM}

Hi,

The error you mentioned, Event ID 1335, is that input error of 1135?

Event ID 1006 and 1335 are related to network connection issue with cluster node.

Refer to following procedures to troubleshoot this issue:

Run the Validate a Configuration Wizard, selecting only the network and inventory tests.
Check the system event log for hardware or software errors related to the network adapters on this node.
Check the network adapter, cables, and network configuration for the networks that connect the nodes.
Check hubs, switches, or bridges in the networks that connect the nodes.

For more information please refer to following MS articles:

Event ID 1006 — Cluster Service Startup
http://technet.microsoft.com/en-us/library/cc773418(v=ws.10).aspx
Event ID 1135 — Cluster Service Startup
http://technet.microsoft.com/en-us/library/cc773460(v=ws.10).aspx

Hope this helps!

TechNet Subscriber Support

If you are TechNet Subscription user and have any feedback on our support quality, please send your feedback here.

Lawrence

TechNet Community Support

_{Wednesday, February 27, 2013 2:39 AM}

Hi,

I would like to confirm what is the current situation? Have you resolved the problem or do you have any further progress?

If there is anything that we can do for you, please do not hesitate to let us know, and we will be happy to help.

Lawrence

TechNet Community Support

_{Thursday, February 28, 2013 10:13 PM}

Thank you for your response. The issue is still not resolved. I appreciate your suggestions, but we ran network cluster validation checks, checked all the network adapters/cables, driver updates, etc... and everything came up clean. The current status is that we have custom netmon scripts running on each node that will stop once an event id of 1135 is generated on any node. This will then, hopefully, give us a completely network trace at the time of the failure and we should be able to pin point the problem. I will keep you posted. Other suggestions are welcomed. Thank you

_{Wednesday, March 6, 2013 6:51 AM}

Please proceed the following step

== Disabled TCP Chimney
http://support.microsoft.com/kb/945977
http://blogs.technet.com/b/networking/archive/2008/11/14/the-effect-of-tcp-chimney-offload-on-viewing-network-traffic.aspx

==Disable multicast by typing the following at the command prompt:
cluster CLUSTERNAME /priv MulticastClusterDisabled=1:DWORD
ref: http://support.microsoft.com/kb/307962

Also ensure that the network teaming is disable

_{Wednesday, March 6, 2013 1:45 PM}

Thank you, but we already disabled TCP Chimney and the Multicast article applies to windows server 2003. We are on windows server 2008 R2. Also, we don't use network teaming. Thank you for your responses. I should have mentioned this earlier. Let me know if you have any other ideas.

_{Thursday, March 21, 2013 2:53 AM}

Hi,

This problem may require further debug, it’s suggested to contact Microsoft Customer Support Services (CSS) so that a dedicated Support Professional can help you on this issue.

To obtain the phone numbers for specific technology request, please refer to the website listed below:

http://support.microsoft.com/default.aspx?scid=fh;EN-US;PHONENUMBERS’

If you are outside the US, please refer to http://support.microsoft.com for regional support phone numbers.

Thanks for your understanding.

Lawrence

TechNet Community Support

_{Sunday, December 15, 2013 1:30 PM}

Please I would like to know if you already resolved this issue.

In my work place we have the same problem (about 1 year more).

We already try to contact the Microsoft and the response is the same, that we network problems, but the network team doesn´t see anything on our network.

Tks.

Share via

Nodes randomly losing communication with cluster

Question

All replies (8)

Additional resources