Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Question
Monday, May 19, 2014 1:31 PM
Hi everyone!
There is a 5-node SQL Server 2012 failover cluster based on Windows server 2012 Datacenter and built on IBM Bladecenter HS23 type 7875. Cluster nodes are using SAN-boot from IBM Storwize v3700 and LUN's from IBM Storwize v7000.
Periodically on different nodes of the cluster appears an error vent ID 1073 The Cluster service was halted to prevent an inconsistency within the failover cluster. The error code was '668', and Event ID 7031 The Cluster Service service terminated unexpectedly. It has done this 1 time(s). TThe following corrective action will be taken in 60000 milliseconds: Restart the service и Event ID 7024 The Cluster Service service terminated with the following service-specific error: An assertion failure has occurred. After these errors have appeared cluster node hangs in "joining" state and the same happens to all nodes that will be rebooted or turned off, and all operations I try to preform on cluster(stopping cluster service, pause, evict, etc) are failling. Cluster returns to normal state only after all of its node are rebooted. Here's is the piece of cluster log at the time the error occurred:
00000b4c.00000c7c::2014/04/21-03:32:25.939 INFO [VSS] Backing up part of the system state [VSS] OnPrepareBackup: starting new session dfb4fbf0-db28-40d2-af3a-82e66a271267
00000b4c.00000c7c::2014/04/21-03:32:25.939 INFO [VSS] OnPrepareBackup returning - true
00000b4c.00001194::2014/04/21-03:32:26.704 INFO [GUM] Node 7: Processing RequestLock 4:4744
00000b4c.00001198::2014/04/21-03:32:26.704 INFO [GUM] Node 7: Processing GrantLock to 4 (sent by 3 gumid: 11271)
00000b4c.00000e2c::2014/04/21-03:32:26.704 ERR mscs::GumAgent::ExecuteQueuedUpdate: TransactionInProgress(5918)' because of 'Cannot restart an in-progress transaction'
00000b4c.00001194::2014/04/21-03:32:26.719 ERR Failed type check .?AUBoxedNodeSet@mscs@@
00000b4c.00001194::2014/04/21-03:32:26.719 ERR [CORE] mscs::ClusterCore::DeliverMessage: TypeMismatch(1629)' because of 'failed type check'
00000b4c.00000e2c::2014/04/21-03:32:26.750 INFO [VSS] HandleBackupGum - Initiating the backup
00000b4c.00000e2c::2014/04/21-03:32:26.750 INFO [VSS] HandleOnFreezeGum - Stopping the Death Timer
00000b4c.00000e2c::2014/04/21-03:32:26.750 INFO [VSS] HandleBackupGum - Completed the backup Request
00000b4c.00000e2c::2014/04/21-03:32:26.750 ERR [GUM] Node 7: sequenceNumber + 1 == payload->GumId (5129, 11272)
00000b4c.00000e2c::2014/04/21-03:32:26.750 ERR mscs::GumAgent::ExecuteQueuedUpdate: AssertionFailed(668)' because of 'failed assertion'(sequenceNumber + 1 == payload->GumId is false)
00000b4c.00000e2c::2014/04/21-03:32:26.750 ERR GumHandler failed (status = 668)
00000b4c.00000e2c::2014/04/21-03:32:26.750 ERR GumHandler failed (status = 668), executing OnStop
00000b4c.00000e2c::2014/04/21-03:32:26.750 INFO [DM]: Shutting down, so unloading the cluster database.
00000b4c.00000e2c::2014/04/21-03:32:26.750 INFO [DM] Shutting down, so unloading the cluster database (waitForLock: false).
00000b4c.00000e2c::2014/04/21-03:32:26.813 ERR FatalError is Calling Exit Process.
00000b4c.00000b50::2014/04/21-03:32:26.813 INFO [CS] About to exit process...
000015d0.000015d4::2014/04/21-03:32:26.828 WARN [RHS] Cluster service has terminated.
00001618.0000161c::2014/04/21-03:32:26.828 WARN [RHS] Cluster service has terminated.
00001588.0000158c::2014/04/21-03:32:26.828 WARN [RHS] Cluster service has terminated.
000015f4.000015f8::2014/04/21-03:32:26.828 WARN [RHS] Cluster service has terminated.
All of the reccommeded failover cluster updates and hotfixes are installed and the cluster is validated.
All replies (5)
Saturday, May 31, 2014 5:56 PM ✅Answered
I'd focus on why the cluster service is terminating rather than focusing on why it cant recover as both are likely caused by the same issue.
You should check to see if you have updated NIC drivers available. This type of issue is typically caused by some sort of glitch in network communications between nodes.
Visit my blog about multi-site clustering
Tuesday, June 3, 2014 7:13 PM ✅Answered
Hello Turinus,
As per the cluster logs mentioned above it seems like on when DPM backup was running, on Node 7, GUM updates failed, may be because of network congestion.This lead to Cluster service termination.
Please make sure we should have dedicated physical networks {NO VLAN} for Heartbeat and Public network on all cluster nodes.
Also please verify the file version mentioned in following hotfix and install first 4 hotfix in list if they are outdated.
http://support.microsoft.com/kb/2784261/en-us
Regards,
Monday, May 19, 2014 1:42 PM
Hi,
looks like there is running a backupat the time the error occurred.
What backup solution is in use to backup the cluster?
Regards
Sebastian
Tuesday, May 20, 2014 2:48 PM
SC DPM 2012 Rollup 5
Thursday, May 29, 2014 1:31 PM
No, we don't do backup of a cluster disks, we do backup of a System state, BMR and SQL DB's.