Share via


Hyper-V VMMS service stuck

Question

Thursday, November 22, 2018 1:25 PM

Hi,

this is the second time we encounter this issue this year. It happened some months ago also. Suddenly on a host (can be any host in cluster) the vmms service is not responding anymore. You cannot do anything anymore with VM's. The ones that are running will keep running, but as soon as you try to reboot a VM (properly reboot from inside the VM), it is stuck in stopping state. Also: live migration to other hosts are failing. Last time we foreced reboot of host, but unfortunately this gave as quite some headaches with corrupt vhdx files.

After googling alot, I thought to give Technet forum a shot.

Any advise? I have a MS case open but as usual, takes forever to get some kind of help.

thanks in advance!

regards,

Jeroen

All replies (14)

Thursday, November 22, 2018 2:14 PM

Stop and start the VMMS service.  This can be done from the Hyper-V GUI or task manager or PowerShell.

tim


Friday, November 23, 2018 8:29 AM

Hi,

Thanks for posting in our forum!

Instead of restarting the host server, you can try to terminate the process using the process resource manager on sysinternals.

You can refer to the following link:

https://social.technet.microsoft.com/Forums/en-US/06f06fe7-3df9-4eb6-ad1c-f0ad231dd4a3/host-not-responding-hyperv-virtual-machine-management-service-in-stopping-state?forum=virtualmachingmgrhyperv

This document is about sysinternals, just for your reference:

/en-us/sysinternals/

In addition, you can also check the status of other hardware on the host system, including the system motherboard, storage hardware and on-board network cards. See if there is a driver that needs to be updated.

http://techgenix.com/hyper-v-troubleshooting/

Please Note: Since the web site is not hosted by Microsoft, the link may change without notice. Microsoft does not guarantee the accuracy of this information.

Hope this information can help you, if you have any question, please feel free to let me know.

Thanks for your time and have a nice day!

Best Regards,

Daniel

Please remember to mark the replies as answers if they help.
If you have feedback for TechNet Subscriber Support, contact [email protected].


Friday, November 23, 2018 9:32 AM

@Tim: that would indeed be the easy way, but if we do that the service remains in stopping mode. Even if de try to reboot the host, it remains in shutting down and cannot terminate the vmms service


Friday, November 23, 2018 9:38 AM

@Daniel: thank you for the reply, but we also tried killing process using this way, doesn't help.

I opened a case @ Nutanix, which is the underlying layer, and this is the outcome (I will send it to MS but don't expect much feedback - btw in the meanwhile we were forced to hard reset two nodes (both had same issue))

Feedback from Nutanix engineer (guy knows what he's talking about):

I have had a look at the dump. It looks to be a different issue, however it is likely to be Microsoft issue and their engagement is required to analyze the dump.

I can see that these two threads were likely stuck behind this thread:

0: kd> !mex.t ffff8f02ffc59800

Process                   Thread           CID       UserTime KernelTime ContextSwitches Wait Reason Time State

System (ffff8708ea8ab700) ffff8f02ffc59800 4.5d00          0s      469ms         1333708 Executive   15ms Waiting

WaitBlockList:

    Object           Type                 Other Waiters

    ffffdd81dc42c8e0 SynchronizationTimer             0

# Child-SP         Return           Call Site

0 ffffdd81dc42c570 fffff803a10d0d6d nt!KiSwapContext+0x76

1 ffffdd81dc42c6b0 fffff803a10d080f nt!KiSwapThread+0x17d

2 ffffdd81dc42c760 fffff803a10d25e7 nt!KiCommitThreadWait+0x14f

3 ffffdd81dc42c800 fffff804f4e02aa5 nt!KeWaitForSingleObject+0x377

4 ffffdd81dc42c8b0 fffff804f67fbd7e NDIS!NdisMSleep+0x55

5 ffffdd81dc42c930 fffff804f67b1552 vmswitch!VmsVmqPvtDeleteVmq+0x570be

6 ffffdd81dc42ca00 fffff804f67a02fd vmswitch!VmsVmqPvtDeleteVmqsOnPtNic+0x6a

7 ffffdd81dc42ca30 fffff804f6847f61 vmswitch!VmsVmqDoVmqOperation+0xd5

8 ffffdd81dc42ca90 fffff803a111e561 vmswitch!VmsPtNicPvtHandleVmqCapsChangeWorkItem+0xe1

9 ffffdd81dc42cb10 fffff803a10bf549 nt!IopProcessWorkItem+0x81

a ffffdd81dc42cb80 fffff803a10f6101 nt!ExpWorkerThread+0xe9

b ffffdd81dc42cc10 fffff803a11e3ca6 nt!PspSystemThreadStartup+0x41

c ffffdd81dc42cc60 0000000000000000 nt!KiStartSystemThread+0x16

As there are two vmwp.exe processes with 1 thread only waiting on the above thread:

dt nt!_KMUTANT fffff804f68b2e40 () Recursive: [ -r1 -r2 -r ] Verbose Normal dt

==================================================================================

   +0x000 Header               : _DISPATCHER_HEADER

   +0x018 MutantListEntry      : _LIST_ENTRY [ 0xffff8f02`ffc59b08 - 0xffff8f02`ffc59b08 ] [EMPTY OR 1 ELEMENT]

   +0x028 OwnerThread          : 0xffff8f02`ffc59800 _KTHREAD (System)

   +0x030 Abandoned            : 0 ''

   +0x031 ApcDisable           : 0x1 ''

Number of waiters: 104

Owning Thread: ffff8f02ffc59800

Two vmwp.exe processes with one thread only. They are good candidate as we know that VM got stuck in Stopping state, which means the process was supposed to be terminated.

0: kd> !us -p ffff8f0303c1a080

1 thread [stats]: ffff87090da59080

    fffff803a11e35b6 nt!KiSwapContext+0x76

    fffff803a10d0d6d nt!KiSwapThread+0x17d

    fffff803a10d080f nt!KiCommitThreadWait+0x14f

    fffff803a10d25e7 nt!KeWaitForSingleObject+0x377

    fffff804f67c5b63 vmswitch!VmsVmNicPvtKmclChannelOpened+0x73

    fffff804f673f2e0 vmbkmclr!KmclpServerOpenChannel+0x148

    fffff804f673e946 vmbkmclr!KmclpServerOfferChannel+0x262

    fffff804f673e664 vmbkmclr!VmbChannelEnable+0x104

    fffff804f6772f17 vmswitch!VmsVmNicMorph+0x54f

    fffff804f67725e9 vmswitch!VmsCdpNicMorphToVmNic+0x145

    fffff804f67909bd vmswitch!VmsCdpDeviceControl+0x65d

    fffff803a1560150 nt!IopSynchronousServiceTail+0x1a0

    fffff803a155f4ec nt!IopXxxControlFile+0xd9c

    fffff803a155e746 nt!NtDeviceIoControlFile+0x56

    fffff803a11ec503 nt!KiSystemServiceCopyEnd+0x13

    00007ffacc1c5be4 0x7ffacc1c5be4

1 stack(s) with 1 threads displayed (1 Total threads)

0: kd> !us -p ffff8f02e5fa2840

1 thread [stats]: ffff8708f3ff3080

    fffff803a11e35b6 nt!KiSwapContext+0x76

    fffff803a10d0d6d nt!KiSwapThread+0x17d

    fffff803a10d080f nt!KiCommitThreadWait+0x14f

    fffff803a10d25e7 nt!KeWaitForSingleObject+0x377

    fffff804f679cf70 vmswitch!VmsVmNicPvtDisableOptimizations+0x50

    fffff804f679c809 vmswitch!VmsCdpNicDisableOptimizations+0x11d

    fffff804f6790903 vmswitch!VmsCdpDeviceControl+0x5a3

    fffff803a1560150 nt!IopSynchronousServiceTail+0x1a0

    fffff803a155f4ec nt!IopXxxControlFile+0xd9c

    fffff803a155e746 nt!NtDeviceIoControlFile+0x56

    fffff803a11ec503 nt!KiSystemServiceCopyEnd+0x13

    00007ffacc1c5be4 0x7ffacc1c5be4

1 stack(s) with 1 threads displayed (1 Total threads)

The thread seems to be stuck in infinite loop, calling NdisMSleep:

0: kd> .frame /r 5

05 ffffdd81`dc42c930 fffff804`f67b1552 vmswitch!VmsVmqPvtDeleteVmq+0x570be

rax=0000000000000000 rbx=ffff8f02f243f010 rcx=0000000000000000

rdx=0000000000000000 rsi=00000000000a2c9c rdi=ffff8f02f9cf9920

rip=fffff804f67fbd7e rsp=ffffdd81dc42c930 rbp=ffffdd81dc42c999

r8=0000000000000000  r9=0000000000000000 r10=0000000000000000

r11=0000000000000000 r12=0000000000000000 r13=ffff8f02e59b1000

r14=000000000000000c r15=fffff804f6895638

iopl=0         nv up di pl nz na pe nc

cs=0000  ss=0000  ds=0000  es=0000  fs=0000  gs=0000             efl=00000000

vmswitch!VmsVmqPvtDeleteVmq+0x570be:

fffff804`f67fbd7e ffc6            inc     esi

0: kd> ub fffff804f67fbd7e L20

vmswitch!VmsVmqPvtDeleteVmq+0x5702f:

fffff804`f67fbcef 41b95b000000    mov     r9d,5Bh

fffff804`f67fbcf5 4889742428      mov     qword ptr [rsp+28h],rsi

fffff804`f67fbcfa 4889442420      mov     qword ptr [rsp+20h],rax

fffff804`f67fbcff 458d41b8        lea     r8d,[r9-48h]

fffff804`f67fbd03 e8380bfdff      call    vmswitch!WPP_RECORDER_SF_s (fffff804`f67cc840)

fffff804`f67fbd08 440fb765fb      movzx   r12d,word ptr [rbp-5]

fffff804`f67fbd0d 33c0            xor     eax,eax

fffff804`f67fbd0f 4c8d3d22990900  lea     r15,[vmswitch!WPP_08264671732f3223572c011804749aa6_Traceguids (fffff804`f6895638)]

fffff804`f67fbd16 4889451f        mov     qword ptr [rbp+1Fh],rax

fffff804`f67fbd1a 214523          and     dword ptr [rbp+23h],eax

fffff804`f67fbd1d c7451f80010c00  mov     dword ptr [rbp+1Fh],0C0180h

fffff804`f67fbd24 33f6            xor     esi,esi

fffff804`f67fbd26 448d700c        lea     r14d,[rax+0Ch]

fffff804`f67fbd2a 8b4338          mov     eax,dword ptr [rbx+38h]

fffff804`f67fbd2d 894527          mov     dword ptr [rbp+27h],eax

fffff804`f67fbd30 8364245000      and     dword ptr [rsp+50h],0

fffff804`f67fbd35 488d451f        lea     rax,[rbp+1Fh]

fffff804`f67fbd39 488b4b28        mov     rcx,qword ptr [rbx+28h]

fffff804`f67fbd3d 41b924020100    mov     r9d,10224h                                                                            <<<<< OID_RECEIVE_FILTER_FREE_QUEUE

fffff804`f67fbd43 4489742448      mov     dword ptr [rsp+48h],r14d

fffff804`f67fbd48 41b801000000    mov     r8d,1

fffff804`f67fbd4e 4889442440      mov     qword ptr [rsp+40h],rax

fffff804`f67fbd53 498bd5          mov     rdx,r13

fffff804`f67fbd56 8364243800      and     dword ptr [rsp+38h],0

fffff804`f67fbd5b 8364243000      and     dword ptr [rsp+30h],0

fffff804`f67fbd60 8364242800      and     dword ptr [rsp+28h],0

fffff804`f67fbd65 8364242000      and     dword ptr [rsp+20h],0

fffff804`f67fbd6a e8a5bff9ff      call    vmswitch!VmsPtNicProcessPublicOid (fffff804`f6797d14)

fffff804`f67fbd6f 85c0            test    eax,eax

fffff804`f67fbd71 7441            je      vmswitch!VmsVmqPvtDeleteVmq+0x570f4 (fffff804`f67fbdb4)

fffff804`f67fbd73 b9a8610000      mov     ecx,61A8h <<<< 25000 in dec

fffff804`f67fbd78 ff1562230c00    call    qword ptr [vmswitch!_imp_NdisMSleep (fffff804`f68be0e0)]

Looking at logical block of the function:

vmswitch!VmsVmqPvtDeleteVmq+0x57070:

fffff804`f67fbd30 8364245000      and     dword ptr [rsp+50h],0

fffff804`f67fbd35 488d451f        lea     rax,[rbp+1Fh]

fffff804`f67fbd39 488b4b28        mov     rcx,qword ptr [rbx+28h]

fffff804`f67fbd3d 41b924020100    mov     r9d,10224h

fffff804`f67fbd43 4489742448      mov     dword ptr [rsp+48h],r14d

fffff804`f67fbd48 41b801000000    mov     r8d,1

fffff804`f67fbd4e 4889442440      mov     qword ptr [rsp+40h],rax

fffff804`f67fbd53 498bd5          mov     rdx,r13

fffff804`f67fbd56 8364243800      and     dword ptr [rsp+38h],0

fffff804`f67fbd5b 8364243000      and     dword ptr [rsp+30h],0

fffff804`f67fbd60 8364242800      and     dword ptr [rsp+28h],0

fffff804`f67fbd65 8364242000      and     dword ptr [rsp+20h],0

fffff804`f67fbd6a e8a5bff9ff      call    vmswitch!VmsPtNicProcessPublicOid (fffff804`f6797d14)

fffff804`f67fbd6f 85c0            test    eax,eax

fffff804`f67fbd71 7441            je      vmswitch!VmsVmqPvtDeleteVmq+0x570f4 (fffff804`f67fbdb4)  Branch

vmswitch!VmsVmqPvtDeleteVmq+0x570b3:

fffff804`f67fbd73 b9a8610000      mov     ecx,61A8h

fffff804`f67fbd78 ff1562230c00    call    qword ptr [vmswitch!_imp_NdisMSleep (fffff804`f68be0e0)]

fffff804`f67fbd7e ffc6            inc     esi

fffff804`f67fbd80 81fe90010000    cmp     esi,190h                                                                              <<<<< seems to be retry count

fffff804`f67fbd86 72a8            jb      vmswitch!VmsVmqPvtDeleteVmq+0x57070 (fffff804`f67fbd30)  Branch                    <<<<< jump if below 400 (retry count)

vmswitch!VmsVmqPvtDeleteVmq+0x570c8:

fffff804`f67fbd88 488b0d29740b00  mov     rcx,qword ptr [vmswitch!VmsIfrLog (fffff804`f68b31b8)]

fffff804`f67fbd8f 488d051aadfdff  lea     rax,[vmswitch! ?? ::FNODOBFM::`string' (fffff804`f67d6ab0)]

fffff804`f67fbd96 41b95c000000    mov     r9d,5Ch

fffff804`f67fbd9c 4889442428      mov     qword ptr [rsp+28h],rax

fffff804`f67fbda1 4c897c2420      mov     qword ptr [rsp+20h],r15

fffff804`f67fbda6 458d41b7        lea     r8d,[r9-49h]

fffff804`f67fbdaa e8910afdff      call    vmswitch!WPP_RECORDER_SF_s (fffff804`f67cc840)

fffff804`f67fbdaf e97cffffff      jmp     vmswitch!VmsVmqPvtDeleteVmq+0x57070 (fffff804`f67fbd30)  Branch

0: kd> da fffff804`f67d6ab0

fffff804`f67d6ab0  "retryCount < 400"

However, we can see that retry count is definitely above 400:

0: kd> r esi

Last set context:

esi=a2c9c

0: kd> ? a2c9c

Evaluate expression: 666780 = 00000000`000a2c9c

So it looks like it is trying to set OID on vmswitch interface and for some reason it does not completes successfully.


Monday, November 26, 2018 12:35 PM

Hi,

Thanks for your reply!

For the VMMS service stuck issue, in general, we need collect dump file for VMMS service when VMMS stuck like blow.

Due to the issue need dump analyze, I would like to suggest open case to Microsoft.

https://support.microsoft.com/en-us/gp/contactus81?Audience=Commercial&wa=wsignin1.0

Thanks for your understanding, if you have any question, please feel free to let me know.

Best Regards,

Daniel

Please remember to mark the replies as answers if they help.
If you have feedback for TechNet Subscriber Support, contact [email protected].


Tuesday, November 27, 2018 9:49 AM

thanks for the reply. A Nutanix engineer created also this dump file and I have sent this to Microsoft. They will not help us because the cluster validation report showed an error. It's just unbelievable and not acceptable... They even closed the case already. In the past few years, I opened several cases @ MS, and never, but really NEVER, had a good experience! I was a MS fanboy back in the days but after the last few years, I've really changed my mind... Not going to go in too much detail here, but it's really dissapointing.


Tuesday, November 27, 2018 9:51 AM

ow and by the way, we forced hard reset the nodes and VM's came back. 1 VM had vhdx corruption so we did snapshot restore. And now MS REALLY don't won't to help anymore because they don't do RCA (root cause analysis) :-(


Tuesday, November 27, 2018 12:51 PM

Hi Jeroen,

Thanks for your reply!

First of all, I am very happy to hear that your problem has progressed. In addition, I fully understand your concern, please also understand that our team will do our best to solve each customer's problem. Sometimes some problems will exceed our support, we need to seek further help from the senior team.

Finally, please rest assured that we will do our best to resolve the posts of each customer, if you have any other question in the further, please feel free to post in our forum.

Please continue to believe in Microsoft, and we will continue to improve our services.

Thanks again for your reply and understanding, wish you all the best!

Best Regards,

Daniel

Please remember to mark the replies as answers if they help.
If you have feedback for TechNet Subscriber Support, contact [email protected].


Tuesday, November 27, 2018 8:49 PM

"They will not help us because the cluster validation report showed an error. It's just unbelievable and not acceptable... They even closed the case already. In the past few years, I opened several cases @ MS, and never, but really NEVER, had a good experience! I was a MS fanboy back in the days but after the last few years, I've really changed my mind... Not going to go in too much detail here, but it's really dissapointing."

I can understand being disappointed, but I can also understand Microsoft's position.  If the cluster validation report is showing an error, there is something wrong with the cluster.  It is that simple.  Errors must be fixed.  Warnings are little different.  The cluster validation report will report warnings for things that could cause potential problems, but there is also the possibility that the cluster can run most things without issue.  However, each warning needs to be investigated for your particular situation.  But an error is cut-and-dried.  You must fix the error or there is no guarantee the cluster will work properly.

The solution to Microsoft not providing support is for you to fix the error.

tim


Wednesday, November 28, 2018 12:33 PM

more info about the cluster validation error:

in the failover cluster manager, there are 3 cluster networks. 1 of them is used for storage communication to the underlying Nutanix layer. This cluster network is bound to a vswitch on each host, pointing to the same ip address on each host, which is normal and by design.

This is what's giving these errors in the report (duplicate ips)

We could make the errors go away by doing the following: on the properties of the cluster network used for storage: disable cluster communication. This will exclude the cluster network from the validation test. BUT: we are unsure that it will cause interruptions on our production environment so we decided not to change it...


Wednesday, November 28, 2018 2:24 PM | 1 vote

Duplicate IPs is never a good thing, whether in a cluster or not.  That is sure to end in unwanted results.  Seems a very strange thing to leave in.

In general, you should not have cluster communication on a storage network.  You should have a completely separate network for cluster communication.  You can allow cluster communication on multiple networks, but my recommendation would be to ensure cluster communication is selected only for management networks, as it is a management protocol.  I never would configure cluster communication on a storage network, even as a backup for another network that had cluster communication defined on it.  Will it work?  Sure.  Can it cause issues? Potentially.  Is it a recommended practice? Nope.

tim


Thursday, November 29, 2018 6:52 AM

Hi,

This is Daniel and wish you all the best!

just confirm the current status of the issue, if Tim's reply was useful to you, please mark it as an answer!

Thanks for your cooperation.

Best Regards,

Daniel

Please remember to mark the replies as answers if they help.
If you have feedback for TechNet Subscriber Support, contact [email protected].


Thursday, November 29, 2018 11:39 AM

Duplicate IPs is never a good thing, whether in a cluster or not.  That is sure to end in unwanted results.  Seems a very strange thing to leave in.

In general, you should not have cluster communication on a storage network.  You should have a completely separate network for cluster communication.  You can allow cluster communication on multiple networks, but my recommendation would be to ensure cluster communication is selected only for management networks, as it is a management protocol.  I never would configure cluster communication on a storage network, even as a backup for another network that had cluster communication defined on it.  Will it work?  Sure.  Can it cause issues? Potentially.  Is it a recommended practice? Nope.

tim

You're absolutly right (double checked it with Nutanix if it's ok to disable it), and that's why we are going to change the setting (needs to be planned - outside critical hours just in case). Afterwards we'll re-run cluster validation report.


Friday, November 30, 2018 2:46 AM

Hi,

Thanks for your reply!

I will closely monitor this case and if you have any question, please feel free to post in our forum.

Thanks again for your time and have a nice day!

Best Regards,

Daniel

Please remember to mark the replies as answers if they help.
If you have feedback for TechNet Subscriber Support, contact [email protected].