Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Question
Friday, August 8, 2014 10:57 AM
Hello,
I will describe the current situation, the lab and the cluster log.
I'm looking forward to any hint to get the cluster running again.
THE SITUATION
I am working on a cluster resource dll, porting it from Server 2003 to Server 2012R2.
After a reboot of my lab the cluster service does not start anymore on both nodes. The error messages in the eventlog and cluster log say, the system cannot find the file specified, but do not say which file.
I can access the file share witness from both nodes. The machines can ping each other.
Before the reboot, all cluster resources could be brought online and moved between the nodes.
THE LAB
The Lab has been set up by a colleague and consists of three VMs running in Virtual Box on Windows 7 Enterprise 64 Bit:
1. Server 2008R2 as
Domain Controller and
provider of the file share witness.
2. Server 2012R2 as Node 1
3. Server 2012R2 as Node 2
THE LOG
The last lines of the cluster log, containing the error messages, you can find below.
The line with the first occurence of 'file not found' code 2 contains ClRtlOpenFileEx. But an internet search gives no result at all for this function.
00000bd4.00000ab0::2014/08/08-08:50:31.504 ERR mscs::ReservationAgent::UpdateClusDiskMembership: (87)' because of 'DeviceIoControl failed'
00000bd4.00000ab0::2014/08/08-08:50:31.504 ERR [DCM] StartRdrCsvInstance: instance \Device\SmbCsv, status 5
00000bd4.00000ab0::2014/08/08-08:50:31.504 ERR [CORE] Node 1: exception caught (2)' because of 'ClRtlOpenFileEx(&hControlRdr_, NULL, (LPWSTR)RdrDeviceName.c_str(), SYNCHRONIZE, FILE_SHARE_VALID_FLAGS, FILE_SYNCHRONOUS_IO_NONALERT)'
00000bd4.00000ab0::2014/08/08-08:50:31.504 ERR Exception in the PostForm is fatal (status = 2)
00000bd4.00000ab0::2014/08/08-08:50:31.504 ERR Exception in the PostForm is fatal (status = 2), executing OnStop
00000bd4.00000ab0::2014/08/08-08:50:31.504 INFO [DM]: Shutting down, so unloading the cluster database.
00000bd4.00000ab0::2014/08/08-08:50:31.504 INFO [DM] Shutting down, so unloading the cluster database (waitForLock: false).
00000bd4.00000ab0::2014/08/08-08:50:31.566 ERR FatalError is Calling Exit Process.
00000af0.00000928::2014/08/08-08:50:31.566 WARN [RHS] Cluster service has terminated. Cluster.Service.Running.Event got signaled.
00000b38.00000b20::2014/08/08-08:50:31.566 WARN [RHS] Cluster service has terminated. Cluster.Service.Running.Event got signaled.
00000b38.00000b20::2014/08/08-08:50:31.566 INFO [RHS] Exiting.
00000af0.00000928::2014/08/08-08:50:31.582 INFO [RHS] Exiting.
Thank you for reading and supporting.
SchLois
All replies (11)
Friday, August 8, 2014 3:03 PM ✅Answered
it starts by complaining about disk resource reservation, other than the share witness, what disks are on your cluster and how are they setup?
to start off, shut down both cluster nodes, boot up only one node, and see if you can connect to it locally, not by the cluster name
you can also try some of the options in powershell or cluster.exe to start the cluster service in recovery mode and maybe get more details
make sure your file witness share is online and accessible, check its permissions
any changes that were made prior to the reboot?
Monday, August 11, 2014 12:56 PM ✅Answered
Personally, since this is a lab environment, I would rebuild the cluster and start anew. Yes, there is value in understanding the root cause, but too often in situations like this, documentation was not kept of the exact steps in which things were done. That is often a requirement to find root cause.
Because you are performing this in a virtual environment, I would also be sure to make use of snapshots as I make major changes, such as adding your dll. That way you can roll back to a known good point if something bad happens again. Then if something bad happens again, you can better document the exact steps you took to recreate the problem.
. : | : . : | : . tim
Tuesday, August 12, 2014 1:28 PM ✅Answered
I do recall in 2008 there were specific cluster.exe commands that let you actually remove a DLL resource from the cluster's resource library, as removing the .dll file itself will not remove that reference link which is probably why it didn't help you
however I can't seem to find the equivelant command for this in PowerShell in 2012, you can remove specific resource types but I'm not sure if that will help you
I would agree that you rebuild
I would also be cautious with snapshots for MSCS, not sure how they would work under virtualbox. I suppose if you shutdown your virtual cluster nodes and then take the snapshots it may be ok but I doubt a live snapshot would be of any use
Wednesday, August 13, 2014 3:04 AM ✅Answered
Hi SchLois,
As far as I know the latest Virtual Box® released edition is 4.3.14 and it not support Server 2012r2, you can ask your soft vendor for the further help, in your case personal consider may cause by the shared storage please confirm your shared storage meet failover cluster requirement.
More information:
Support policy for Microsoft software that runs on non-Microsoft hardware virtualization software
http://support.microsoft.com/kb/897615
Understanding Cluster Validation Tests: Storage
http://technet.microsoft.com/en-us/library/cc771259.aspx
The Virtual Box® guest operation support manual:
https://www.virtualbox.org/manual/ch03.html#intro-64bitguests
I’m glad to be of help to you!
Friday, August 8, 2014 2:08 PM
"I am working on a cluster resource dll, porting it from Server 2003 to Server 2012R2."
Does this problem occur only when you have your cluster resource dll as part of the cluster, or does it occur in the cluster without your cluster dll?
. : | : . : | : . tim
Friday, August 8, 2014 2:28 PM
All worked before the reboot with my cluster resource dll. Now I cannot see and not change anything in the cluster configuration as cluadmin keeps empty without the cluster service running.
Eventlog shows the service is started and a few seconds later it terminates unexpected.
Friday, August 8, 2014 10:41 PM
In addition to what Armin is suggesting, I would suggest you try to remove your dll. Despite the fact the cluster appeared to be working fine before the reboot, the fact that it failed on reboot make me suspicious of your dll.
. : | : . : | : . tim
Monday, August 11, 2014 9:01 AM
"what disks are on your cluster"
Each of the three nodes only has one basic disk as boot and system volume.
One Resource (sql server express) uses a share (also hosted on the lab's domain controller) for it's data, but the cluster has no knowledge of and no dependency on this share.
"boot up only one node, and see if you can connect"
Just after starting cluadmin the message is "No items found"
Connect to Cluster <Cluster on this server...> results in
"The cluster service is not running. Make sure that the service is running on all nodes in the cluster.
Error Code: 0800706d9 There are no more endpoints available from the endpoint mapper"
"try some of the options in powershel"
get-cluster : The cluster service is not running. Make sure that the service is running on all nodes in the cluster.
There are no more endpoints available from the endpoint mapper
start-cluster
WARNING: The node cannot be contacted. Ensure that the node is powered on and is connected to the network.
start-cluster : An error occurred while loading state information for the node 'WIN2k12-Node1'.
There are no more endpoints available from the endpoint mapper
During this command a kind of popup has been shown, saying one node is down, one is joining.
"witness share is online"
The witness.log's timestamp is current, so it seems to be accessible.
PS C:\Users\administrator.GERTEST> dir \\win2k8-domaen\QuorumDevice\f275053c-1472-4a30-b560-240e6ae52138
Directory: \\win2k8-domaen\QuorumDevice\f275053c-1472-4a30-b560-240e6ae52138
Mode LastWriteTime Length Name
-a 06.08.2014 17:20 0 VerifyShareWriteAccess.txt
-a 11.08.2014 10:59 88 Witness.log
"changes that were made prior to the reboot"
No changes have been made.
Monday, August 11, 2014 9:06 AM
As i cannot do it with cluadmin, do you mean just to delete the dll or do you know another way like manipulating the cluster hive?
Update:
Deleting my cluster dll in C:\Windows\Cluster\ leads to this additional lines in clusterlog
00000854.00000670::2014/08/11-11:33:52.885 ERR [RHS] s_RhsRpcCreateResType: (126)' because of 'Error loading resource DLL akaridcluster1.dll.'
000007dc.0000088c::2014/08/11-11:33:52.885 INFO [RCM] result of first load attempt for type AkaridCluster1: 126
000007dc.0000088c::2014/08/11-11:33:52.885 WARN [RCM] The DLL for resource type 'AkaridCluster1' is not present.
but the cluster service still doesn't start, the end of the log is unchanged.
Wednesday, August 20, 2014 1:25 AM
Hi,
Just want to confirm the current situations.
Please feel free to let us know if you need further assistance.
Regards.
Monday, October 6, 2014 10:31 AM
Solution
After some iterations in re-building my cluster environment with more frequent snapshots and reboots the reason for the non start issue is clear and easy to solve.
My cluster resource dll needs a non standard privilege. To grant this privilege to the cluster service, I wrote a batch one year ago and had called it add_privs.bat. Using it now I thought it just does what it's name says, it adds privileges. Indeed, it only sets privileges (sc privs clussvc ... has no option to add, just to set). So calling add_privs.bat granted all the privileges that were needed last year on server 2008 R2. Calling it now on Server 2012, the cluster service got my non standard privilege granted, but lost the privileges to load a driver and to shut down the system, that it has already had. A restart of the cluster service alone worked and - equipped with the new privileges - my cluster resource dll also worked. The lack of a privilege only led to problems after a reboot of the system, which occurred many steps after calling add_privs.bat. This and the misleading name of a batch made it rather time consuming to tear down the problem.
Improving my batch by first parsing the output of sc qprivs clussvc and then really adding my privilege solved this will avoid future problems.
Virtual Box 4.3.12 works great on windows 7 enterprise with hosts like Server 2008 (DC, iSCSI target, quorum share) and Server 2012 (cluster, iSCSI initiator, SQLExpress)