Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Question
Wednesday, June 14, 2017 10:47 AM
Hi everyone,
I am experiencing a strange behavior while trying to drain one node of a 2-node S2D Cluster on Server 2016:
In FailoverManager -> Nodes -> right click the node -> pause -> drain roles.
The Status changes to "draining" but then it Shows "drain failed" with Information:
One or more roles were not moved from this node. Use the Roles tab to see These roles, and view their critical events to determine why they were not moved from this node.
And under Show critical Events:
Node drain failed on Cluster node <NodeName>.
Reference the node's System and Application event logs and cluster logs to investigate the cause of the drain failure. When the problem is resolved, you can retry the drain Operation.
But I could not find any errors in none of These Event logs...
To check if there were some Cluster resources that were not moved I executed Get-ClusterResource, but all cluster resources were on the other (running) node (node#2):
PS C:\Windows\system32> Get-ClusterResource | select Name, IsCoreResource, State, OwnerNode
Name IsCoreResource State OwnerNode
Cluster IP Address False Online node#2
Cluster Name True Online node#2
Cluster Pool 1 False Online node#2
File Share Witness True Online node#2
Health False Online node#2
Storage Qos Resource False Online node#2
Virtual Machine <testVM> False Online node#2
Virtual Machine Cluster WMI False Online node#2
Virtual Machine Configuration <testVM> False Online node#2
So I don't understand, why the drain fails... any suggestions?
Thanks!
All replies (11)
Tuesday, June 27, 2017 5:33 PM ✅Answered
Hi,
thanks for the additional informations... now we continued to investigate the issue and ended up with recreating the whole storage pool and vdisks.
I am convinced that some smb settings were setup wrong and this caused the whole system to become inbalanced. I will write again after these steps - lets see.
Wednesday, June 14, 2017 11:32 AM
Hi,
where is this testVM lying on and what happens if you live migrate this VM?
bye,
Marcel
https://www.windowspro.de/marcel-kueppers
I write here only in private interest
Disclaimer: This posting is provided AS IS with no warranties or guarantees, and confers no rights.
Wednesday, June 14, 2017 11:34 AM
Hi,
the testVM is owned by Node#2 (i'm draining node#1). LiveMigration is working flawlessly...
Even when I drain node#1 while the vm is running on node#1 it's beign live migrated correctly. afterwards the same error is shown.
Wednesday, June 14, 2017 11:45 AM
Is the file share witness accessible from both nodes?
Try to takeover manually the resources and check step by step.
https://www.windowspro.de/marcel-kueppers
I write here only in private interest
Disclaimer: This posting is provided AS IS with no warranties or guarantees, and confers no rights.
Thursday, June 15, 2017 5:05 PM | 1 vote
When I tested a two-node S2D cluster in my home lab I ended up with almost the same issue as this guy: except in my case, I managed to bring back the pool online.
My failure started during the planned maintenance when I tried to drain the node exactly as you did. I was using Azure Cloud Witness and it seemed that latency caused the issue. Where is your File Share Witness is located?
Also, based on my experience, two-node S2D cluster is a pretty weak setup. S2D does the great job starting from 4 nodes from what I can tell. Consider using StarWind Free or HPE VSA Free. For example, StarWind was initially designed for using in 2-node deployments while HPE VSA works great in "2+witness" or 3-node configurations.
Thursday, June 15, 2017 7:15 PM
The file share is accessible - pingable and via smb.
Is there any cluster log file where I could find more details what could possibly be wrong?
Thursday, June 15, 2017 7:17 PM
The file share witness is a simple share on my 2008 R2 physical dc. both cluster nodes and the dc are on the same switch. ping < 1ms.
Monday, June 19, 2017 2:46 AM
Hi,
You could also follow the blog to do troubleshooting from the logs in server 2016 cluster.
Best Regards,
Mary
Please remember to mark the replies as answers if they help.
If you have feedback for TechNet Subscriber Support, contact [email protected].
Monday, June 19, 2017 2:59 PM
hi, thank you for that hint!
in the get-clusterlog log files there are several occurences of these errors:
WARN [RHS] Error 50 from resource type control for restype Storage Replica.
ERR [API] ApipGetLocalCallerInfo: Error 3221356570 calling RpcBindingInqLocalClientPID.
and this error seems to be the reason why the drain role command Fails:
ERR [RCM] rcm::DrainMgr::SetStorageMaintenanceMode: [DrainMgr] Storage Maintenance Mode enable:true fail. Error 0x9 - One or more physical disks host data for virtual disks that have a lower fault domain awareness than the fault domain object specified.
any idea on that?
Tuesday, June 20, 2017 2:08 AM
Hi YankeeP,
For now I couldn't find the official documents from Microsoft that describe this error.
For more professional support about log analysis , I suggest you may need to contact CSS this as it is likely going to require deeper technical analysis beyond the scope of forums. In addition, please check the article about configure S2D with 2 nodes.
https://www.tech-coffee.net/fault-domain-awareness-with-storage-spaces-direct/
Please Note: Since the web site is not hosted by Microsoft, the link may change without notice. Microsoft does not guarantee the accuracy of this information.
Best Regards,
Mary
Please remember to mark the replies as answers if they help.
If you have feedback for TechNet Subscriber Support, contact [email protected].
Wednesday, June 28, 2017 1:37 AM
Hi YankeeP,
Glad that it ends up now and thanks for sharing your workaround.
Best Regards,
Mary
Please remember to mark the replies as answers if they help.
If you have feedback for TechNet Subscriber Support, contact [email protected].