Share via


Failover Cluster Service takes extremely long to start after power outage

Question

Monday, March 21, 2016 2:20 PM

Dear all

These past weeks we have been experiencing intermittent power outages. After our UPS batteries drain our Hyper-V Cluster shuts down. However when power is restored our Hyper-V cluster takes a very long time to start up. In fact I wouldn't be able to ping it or browse the folders containing our VM VHD files. I am able to ping our Blade Server nodes. After about an hour and a half the cluster service comes back on and I am then able to manage it via Failover Cluster Manager. All our VMs then start up successfully. 

Let me explain in brief our current setup. We have an IBM BladeCenter with 3 blade servers and cluster storage. One of our VMs is running as a Primary Domain Controller (PDC). However for redundancy purposes I have an additional domain controller running on a physical server and this is also hosting DNS. 

What I cannot understand is why the cluster service takes so very long to initialise. When I do a netdom query fsmo I notice that all roles including that of PDC are residing on the VM Primary Domain Controller. Should I transfer all these roles to our second domain controller so as to resolve this issue? 

Your comments and help would be much appreciated.

Thanks

Pierre

All replies (10)

Monday, March 21, 2016 8:33 PM ✅Answered

It looks like there was a problem with your DNS, possibly as a result of the system 'crash'.  That would explain why browse and ping (by name) would not work.  It would also explain why connecting to the cluster console would not work because it is trying to get information from DNS, too.

Personal preference coming up:  I know that Microsoft supports domain controllers as highly available virtual machines.  However, I never configure mine that way.  Or if I do, I do not configure it on a cluster that is dependent up it.  I just like to keep things clean.  I may configure a DC as a VM on a cluster node, but I do not configure it as an HA VM.  I configure it on local storage.  That way the VM start is not dependent upon the cluster start, and the cluster does not have to have the VM running for it to start.  Yes, your cluster should have tried communicating to the second DC, but I have seen some things happen when cached data is attempted first.  It could be the cluster nodes were trying to contact the DC on the cluster and it was in the process of trying to recover its DNS information.  So the cluster saw the DC was alive, but it was not responding because it wasn't done bringing up one of its critical components.  I'm just guessing on this, but it's the sort of thing I try to avoid by keeping my DCs off the cluster.

. : | : . : | : . tim


Tuesday, March 22, 2016 2:36 AM ✅Answered

To eliminate the variable of name resolution from a remote client, you can log on to the local console and open Failover Cluster Manager and then put in a period "." for the name of the cluster, and it will do a local connection to that nodes cluster.  That will remove any remote connections and any inability to manage due to name resolution.

Then you will be able to truly assess the state of the cluster.

Thanks!
Elden


Monday, March 21, 2016 3:47 PM

Are you seeing anything in your event logs?

"After our UPS batteries drain our Hyper-V Cluster shuts down."

Are you letting the system 'crash' or are you managing a controlled shutdown by draining the nodes and cleanly shutting down the hosts before the batteries drain?

. : | : . : | : . tim


Monday, March 21, 2016 3:52 PM

Hi Tim

Thanks for replying. We noted that they weren't being shut down properly due to the UPS agents not working well. I have now rectified these issues so the nodes should now shut down properly in the event the UPS batteries drain.

In the event log of our second domain controller I was seeing Event ID 4013. 

The DNS server is waiting for Active Directory Domain Services (AD DS) to signal that the initial synchronization of the directory has been completed. The DNS server service cannot start until the initial synchronization is complete because critical DNS data might not yet be replicated onto this domain controller. If events in the AD DS event log indicate that there is a problem with DNS name resolution, consider adding the IP address of another DNS server for this domain to the DNS server list in the Internet Protocol properties of this computer. This event will be logged every two minutes until AD DS has signaled that the initial synchronization has successfully completed.

TIA

Pierre


Monday, March 21, 2016 9:04 PM

The original PDC used to reside on an old external server and when that died I brought in a second domain controller which had already been configured. However we had to set the VM DC as the PDC. Then should I transfer the role of PDC from this VM DC to the second domain controller running on a physical server? Would that speed things up?

Further we are running Hyper-V R2 and CSV2.0. Is the latest CSV still dependent on a DC in order to start up?

P

 


Monday, March 21, 2016 10:59 PM

There is no such thing as a Primary Domain Controller in Active Directory.  That was something in NT 4.  There is a PDC emulator FSMO role, and some people call the system that holds that role as the PDC, but AD is a multi-master configuration where all nodes are equal.  That's just for clarification of terminology.  I doubt the terminology has anything to do with the issue you are seeing.

I have no idea if transferring the FSMO role to another system would 'speed things up'.  Are you continuing to see the issue? Or, since you have fixed the issue with properly shutting down the nodes, has the issue disappeared?  I am just guessing that the 'crash' caused by not cleanly shutting down the servers may have been the root cause of the issue.  If also might be something that happened just once and is not going to happen again.  I can't tell from what you have reported.

As a suggestion, I also reconfigure the AD VM as a non-HA VM.  I used to have some AD VMs as HA VMs back when running 2008 R2 and had some issues.  Since then I have always made my AD VMs non-HA and have never had an issue since then.

You say you are running Hyper-V R2, which could be Hyper-V 2008 R2 or Hyper-V 2012 R2.  CSV2.0?  Not sure what you are talking about there.  Do you mean you are using CSV on Windows Server 2012 or Windows Server 2012 R2?

. : | : . : | : . tim


Tuesday, March 22, 2016 7:45 AM

Hello Tim

Thanks for clarifying about the PDC. I was still thinking about good old NT4 :) We are running Hyper-V 2008 R2 and we are using a cluster shared volume to host all our VMs and our Hyper-V core boot partitions. Hyper-V 2008 R2 has as its preferred DNS the additional physical domain controller and the VM DC as the alternate DNS. When Hyper-V boots up it takes very long to apply computer and user settings. I was told that this long delay is probably due to the DNS. Eventually they do boot up but cannot connect to the cluster storage because as reported earlier it wouldn't have fully started up. After about 15 mins then the cluster storage successfully starts up and all VMs are started.

Is there a particular tech doc you could point me to regarding how to reconfigure an AD VM as a non-HA VM?

Thanks

 


Tuesday, March 22, 2016 7:58 AM

Thanks for your reply. How can I log on to the local console and open Failover Cluster Manager? I normally access Failover Cluster Manager via our dedicated management server.

Thanks


Tuesday, March 22, 2016 12:20 PM

"we are using a cluster shared volume to host all our VMs and our Hyper-V core boot partitions"

Still not sure you are understanding a cluster.  Boot volumes cannot be on the cluster shared volume.  Boot volumes are not shared among nodes of the cluster.

How to make a VM non-HA?  Using Hyper-V console on a cluster node, not the cluster console, create a VM that is stored on non-shared storage, such as the boot volume.

"How can I log on to the local console and open Failover Cluster Manager? "

RDP to the IP of one of the nodes of the cluster to log in locally.

I'm still not clear from your explanation if the slowness is booting is something that happens all the time or if this is something that happened once.  Is this still happening every time you take a node down?

. : | : . : | : . tim


Monday, April 4, 2016 12:48 PM

Hi Pmc1964,

I suppose the reply is helpful and mark it as answer.

If you have any further questions, welcome to post in the forum.

Best Regards,

Leo

Please remember to mark the replies as answers if they help and unmark them if they provide no help. If you have feedback for TechNet Support, contact [email protected].