Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Question
Thursday, November 22, 2018 1:25 PM
Hi,
this is the second time we encounter this issue this year. It happened some months ago also. Suddenly on a host (can be any host in cluster) the vmms service is not responding anymore. You cannot do anything anymore with VM's. The ones that are running will keep running, but as soon as you try to reboot a VM (properly reboot from inside the VM), it is stuck in stopping state. Also: live migration to other hosts are failing. Last time we foreced reboot of host, but unfortunately this gave as quite some headaches with corrupt vhdx files.
After googling alot, I thought to give Technet forum a shot.
Any advise? I have a MS case open but as usual, takes forever to get some kind of help.
thanks in advance!
regards,
Jeroen
All replies (14)
Thursday, November 22, 2018 2:14 PM
Stop and start the VMMS service. This can be done from the Hyper-V GUI or task manager or PowerShell.
tim
Friday, November 23, 2018 8:29 AM
Hi,
Thanks for posting in our forum!
Instead of restarting the host server, you can try to terminate the process using the process resource manager on sysinternals.
You can refer to the following link:
This document is about sysinternals, just for your reference:
In addition, you can also check the status of other hardware on the host system, including the system motherboard, storage hardware and on-board network cards. See if there is a driver that needs to be updated.
http://techgenix.com/hyper-v-troubleshooting/
Please Note: Since the web site is not hosted by Microsoft, the link may change without notice. Microsoft does not guarantee the accuracy of this information.
Hope this information can help you, if you have any question, please feel free to let me know.
Thanks for your time and have a nice day!
Best Regards,
Daniel
Please remember to mark the replies as answers if they help.
If you have feedback for TechNet Subscriber Support, contact [email protected].
Friday, November 23, 2018 9:32 AM
@Tim: that would indeed be the easy way, but if we do that the service remains in stopping mode. Even if de try to reboot the host, it remains in shutting down and cannot terminate the vmms service
Friday, November 23, 2018 9:38 AM
@Daniel: thank you for the reply, but we also tried killing process using this way, doesn't help.
I opened a case @ Nutanix, which is the underlying layer, and this is the outcome (I will send it to MS but don't expect much feedback - btw in the meanwhile we were forced to hard reset two nodes (both had same issue))
Feedback from Nutanix engineer (guy knows what he's talking about):
I have had a look at the dump. It looks to be a different issue, however it is likely to be Microsoft issue and their engagement is required to analyze the dump.
I can see that these two threads were likely stuck behind this thread:
0: kd> !mex.t ffff8f02ffc59800
Process Thread CID UserTime KernelTime ContextSwitches Wait Reason Time State
System (ffff8708ea8ab700) ffff8f02ffc59800 4.5d00 0s 469ms 1333708 Executive 15ms Waiting
WaitBlockList:
Object Type Other Waiters
ffffdd81dc42c8e0 SynchronizationTimer 0
# Child-SP Return Call Site
0 ffffdd81dc42c570 fffff803a10d0d6d nt!KiSwapContext+0x76
1 ffffdd81dc42c6b0 fffff803a10d080f nt!KiSwapThread+0x17d
2 ffffdd81dc42c760 fffff803a10d25e7 nt!KiCommitThreadWait+0x14f
3 ffffdd81dc42c800 fffff804f4e02aa5 nt!KeWaitForSingleObject+0x377
4 ffffdd81dc42c8b0 fffff804f67fbd7e NDIS!NdisMSleep+0x55
5 ffffdd81dc42c930 fffff804f67b1552 vmswitch!VmsVmqPvtDeleteVmq+0x570be
6 ffffdd81dc42ca00 fffff804f67a02fd vmswitch!VmsVmqPvtDeleteVmqsOnPtNic+0x6a
7 ffffdd81dc42ca30 fffff804f6847f61 vmswitch!VmsVmqDoVmqOperation+0xd5
8 ffffdd81dc42ca90 fffff803a111e561 vmswitch!VmsPtNicPvtHandleVmqCapsChangeWorkItem+0xe1
9 ffffdd81dc42cb10 fffff803a10bf549 nt!IopProcessWorkItem+0x81
a ffffdd81dc42cb80 fffff803a10f6101 nt!ExpWorkerThread+0xe9
b ffffdd81dc42cc10 fffff803a11e3ca6 nt!PspSystemThreadStartup+0x41
c ffffdd81dc42cc60 0000000000000000 nt!KiStartSystemThread+0x16
As there are two vmwp.exe processes with 1 thread only waiting on the above thread:
dt nt!_KMUTANT fffff804f68b2e40 () Recursive: [ -r1 -r2 -r ] Verbose Normal dt
==================================================================================
+0x000 Header : _DISPATCHER_HEADER
+0x018 MutantListEntry : _LIST_ENTRY [ 0xffff8f02`ffc59b08 - 0xffff8f02`ffc59b08 ] [EMPTY OR 1 ELEMENT]
+0x028 OwnerThread : 0xffff8f02`ffc59800 _KTHREAD (System)
+0x030 Abandoned : 0 ''
+0x031 ApcDisable : 0x1 ''
Number of waiters: 104
Owning Thread: ffff8f02ffc59800
Two vmwp.exe processes with one thread only. They are good candidate as we know that VM got stuck in Stopping state, which means the process was supposed to be terminated.
0: kd> !us -p ffff8f0303c1a080
1 thread [stats]: ffff87090da59080
fffff803a11e35b6 nt!KiSwapContext+0x76
fffff803a10d0d6d nt!KiSwapThread+0x17d
fffff803a10d080f nt!KiCommitThreadWait+0x14f
fffff803a10d25e7 nt!KeWaitForSingleObject+0x377
fffff804f67c5b63 vmswitch!VmsVmNicPvtKmclChannelOpened+0x73
fffff804f673f2e0 vmbkmclr!KmclpServerOpenChannel+0x148
fffff804f673e946 vmbkmclr!KmclpServerOfferChannel+0x262
fffff804f673e664 vmbkmclr!VmbChannelEnable+0x104
fffff804f6772f17 vmswitch!VmsVmNicMorph+0x54f
fffff804f67725e9 vmswitch!VmsCdpNicMorphToVmNic+0x145
fffff804f67909bd vmswitch!VmsCdpDeviceControl+0x65d
fffff803a1560150 nt!IopSynchronousServiceTail+0x1a0
fffff803a155f4ec nt!IopXxxControlFile+0xd9c
fffff803a155e746 nt!NtDeviceIoControlFile+0x56
fffff803a11ec503 nt!KiSystemServiceCopyEnd+0x13
00007ffacc1c5be4 0x7ffacc1c5be4
1 stack(s) with 1 threads displayed (1 Total threads)
0: kd> !us -p ffff8f02e5fa2840
1 thread [stats]: ffff8708f3ff3080
fffff803a11e35b6 nt!KiSwapContext+0x76
fffff803a10d0d6d nt!KiSwapThread+0x17d
fffff803a10d080f nt!KiCommitThreadWait+0x14f
fffff803a10d25e7 nt!KeWaitForSingleObject+0x377
fffff804f679cf70 vmswitch!VmsVmNicPvtDisableOptimizations+0x50
fffff804f679c809 vmswitch!VmsCdpNicDisableOptimizations+0x11d
fffff804f6790903 vmswitch!VmsCdpDeviceControl+0x5a3
fffff803a1560150 nt!IopSynchronousServiceTail+0x1a0
fffff803a155f4ec nt!IopXxxControlFile+0xd9c
fffff803a155e746 nt!NtDeviceIoControlFile+0x56
fffff803a11ec503 nt!KiSystemServiceCopyEnd+0x13
00007ffacc1c5be4 0x7ffacc1c5be4
1 stack(s) with 1 threads displayed (1 Total threads)
The thread seems to be stuck in infinite loop, calling NdisMSleep:
0: kd> .frame /r 5
05 ffffdd81`dc42c930 fffff804`f67b1552 vmswitch!VmsVmqPvtDeleteVmq+0x570be
rax=0000000000000000 rbx=ffff8f02f243f010 rcx=0000000000000000
rdx=0000000000000000 rsi=00000000000a2c9c rdi=ffff8f02f9cf9920
rip=fffff804f67fbd7e rsp=ffffdd81dc42c930 rbp=ffffdd81dc42c999
r8=0000000000000000 r9=0000000000000000 r10=0000000000000000
r11=0000000000000000 r12=0000000000000000 r13=ffff8f02e59b1000
r14=000000000000000c r15=fffff804f6895638
iopl=0 nv up di pl nz na pe nc
cs=0000 ss=0000 ds=0000 es=0000 fs=0000 gs=0000 efl=00000000
vmswitch!VmsVmqPvtDeleteVmq+0x570be:
fffff804`f67fbd7e ffc6 inc esi
0: kd> ub fffff804f67fbd7e L20
vmswitch!VmsVmqPvtDeleteVmq+0x5702f:
fffff804`f67fbcef 41b95b000000 mov r9d,5Bh
fffff804`f67fbcf5 4889742428 mov qword ptr [rsp+28h],rsi
fffff804`f67fbcfa 4889442420 mov qword ptr [rsp+20h],rax
fffff804`f67fbcff 458d41b8 lea r8d,[r9-48h]
fffff804`f67fbd03 e8380bfdff call vmswitch!WPP_RECORDER_SF_s (fffff804`f67cc840)
fffff804`f67fbd08 440fb765fb movzx r12d,word ptr [rbp-5]
fffff804`f67fbd0d 33c0 xor eax,eax
fffff804`f67fbd0f 4c8d3d22990900 lea r15,[vmswitch!WPP_08264671732f3223572c011804749aa6_Traceguids (fffff804`f6895638)]
fffff804`f67fbd16 4889451f mov qword ptr [rbp+1Fh],rax
fffff804`f67fbd1a 214523 and dword ptr [rbp+23h],eax
fffff804`f67fbd1d c7451f80010c00 mov dword ptr [rbp+1Fh],0C0180h
fffff804`f67fbd24 33f6 xor esi,esi
fffff804`f67fbd26 448d700c lea r14d,[rax+0Ch]
fffff804`f67fbd2a 8b4338 mov eax,dword ptr [rbx+38h]
fffff804`f67fbd2d 894527 mov dword ptr [rbp+27h],eax
fffff804`f67fbd30 8364245000 and dword ptr [rsp+50h],0
fffff804`f67fbd35 488d451f lea rax,[rbp+1Fh]
fffff804`f67fbd39 488b4b28 mov rcx,qword ptr [rbx+28h]
fffff804`f67fbd3d 41b924020100 mov r9d,10224h <<<<< OID_RECEIVE_FILTER_FREE_QUEUE
fffff804`f67fbd43 4489742448 mov dword ptr [rsp+48h],r14d
fffff804`f67fbd48 41b801000000 mov r8d,1
fffff804`f67fbd4e 4889442440 mov qword ptr [rsp+40h],rax
fffff804`f67fbd53 498bd5 mov rdx,r13
fffff804`f67fbd56 8364243800 and dword ptr [rsp+38h],0
fffff804`f67fbd5b 8364243000 and dword ptr [rsp+30h],0
fffff804`f67fbd60 8364242800 and dword ptr [rsp+28h],0
fffff804`f67fbd65 8364242000 and dword ptr [rsp+20h],0
fffff804`f67fbd6a e8a5bff9ff call vmswitch!VmsPtNicProcessPublicOid (fffff804`f6797d14)
fffff804`f67fbd6f 85c0 test eax,eax
fffff804`f67fbd71 7441 je vmswitch!VmsVmqPvtDeleteVmq+0x570f4 (fffff804`f67fbdb4)
fffff804`f67fbd73 b9a8610000 mov ecx,61A8h <<<< 25000 in dec
fffff804`f67fbd78 ff1562230c00 call qword ptr [vmswitch!_imp_NdisMSleep (fffff804`f68be0e0)]
Looking at logical block of the function:
vmswitch!VmsVmqPvtDeleteVmq+0x57070:
fffff804`f67fbd30 8364245000 and dword ptr [rsp+50h],0
fffff804`f67fbd35 488d451f lea rax,[rbp+1Fh]
fffff804`f67fbd39 488b4b28 mov rcx,qword ptr [rbx+28h]
fffff804`f67fbd3d 41b924020100 mov r9d,10224h
fffff804`f67fbd43 4489742448 mov dword ptr [rsp+48h],r14d
fffff804`f67fbd48 41b801000000 mov r8d,1
fffff804`f67fbd4e 4889442440 mov qword ptr [rsp+40h],rax
fffff804`f67fbd53 498bd5 mov rdx,r13
fffff804`f67fbd56 8364243800 and dword ptr [rsp+38h],0
fffff804`f67fbd5b 8364243000 and dword ptr [rsp+30h],0
fffff804`f67fbd60 8364242800 and dword ptr [rsp+28h],0
fffff804`f67fbd65 8364242000 and dword ptr [rsp+20h],0
fffff804`f67fbd6a e8a5bff9ff call vmswitch!VmsPtNicProcessPublicOid (fffff804`f6797d14)
fffff804`f67fbd6f 85c0 test eax,eax
fffff804`f67fbd71 7441 je vmswitch!VmsVmqPvtDeleteVmq+0x570f4 (fffff804`f67fbdb4) Branch
vmswitch!VmsVmqPvtDeleteVmq+0x570b3:
fffff804`f67fbd73 b9a8610000 mov ecx,61A8h
fffff804`f67fbd78 ff1562230c00 call qword ptr [vmswitch!_imp_NdisMSleep (fffff804`f68be0e0)]
fffff804`f67fbd7e ffc6 inc esi
fffff804`f67fbd80 81fe90010000 cmp esi,190h <<<<< seems to be retry count
fffff804`f67fbd86 72a8 jb vmswitch!VmsVmqPvtDeleteVmq+0x57070 (fffff804`f67fbd30) Branch <<<<< jump if below 400 (retry count)
vmswitch!VmsVmqPvtDeleteVmq+0x570c8:
fffff804`f67fbd88 488b0d29740b00 mov rcx,qword ptr [vmswitch!VmsIfrLog (fffff804`f68b31b8)]
fffff804`f67fbd8f 488d051aadfdff lea rax,[vmswitch! ?? ::FNODOBFM::`string' (fffff804`f67d6ab0)]
fffff804`f67fbd96 41b95c000000 mov r9d,5Ch
fffff804`f67fbd9c 4889442428 mov qword ptr [rsp+28h],rax
fffff804`f67fbda1 4c897c2420 mov qword ptr [rsp+20h],r15
fffff804`f67fbda6 458d41b7 lea r8d,[r9-49h]
fffff804`f67fbdaa e8910afdff call vmswitch!WPP_RECORDER_SF_s (fffff804`f67cc840)
fffff804`f67fbdaf e97cffffff jmp vmswitch!VmsVmqPvtDeleteVmq+0x57070 (fffff804`f67fbd30) Branch
0: kd> da fffff804`f67d6ab0
fffff804`f67d6ab0 "retryCount < 400"
However, we can see that retry count is definitely above 400:
0: kd> r esi
Last set context:
esi=a2c9c
0: kd> ? a2c9c
Evaluate expression: 666780 = 00000000`000a2c9c
So it looks like it is trying to set OID on vmswitch interface and for some reason it does not completes successfully.
Monday, November 26, 2018 12:35 PM
Hi,
Thanks for your reply!
For the VMMS service stuck issue, in general, we need collect dump file for VMMS service when VMMS stuck like blow.
Due to the issue need dump analyze, I would like to suggest open case to Microsoft.
https://support.microsoft.com/en-us/gp/contactus81?Audience=Commercial&wa=wsignin1.0
Thanks for your understanding, if you have any question, please feel free to let me know.
Best Regards,
Daniel
Please remember to mark the replies as answers if they help.
If you have feedback for TechNet Subscriber Support, contact [email protected].
Tuesday, November 27, 2018 9:49 AM
thanks for the reply. A Nutanix engineer created also this dump file and I have sent this to Microsoft. They will not help us because the cluster validation report showed an error. It's just unbelievable and not acceptable... They even closed the case already. In the past few years, I opened several cases @ MS, and never, but really NEVER, had a good experience! I was a MS fanboy back in the days but after the last few years, I've really changed my mind... Not going to go in too much detail here, but it's really dissapointing.
Tuesday, November 27, 2018 9:51 AM
ow and by the way, we forced hard reset the nodes and VM's came back. 1 VM had vhdx corruption so we did snapshot restore. And now MS REALLY don't won't to help anymore because they don't do RCA (root cause analysis) :-(
Tuesday, November 27, 2018 12:51 PM
Hi Jeroen,
Thanks for your reply!
First of all, I am very happy to hear that your problem has progressed. In addition, I fully understand your concern, please also understand that our team will do our best to solve each customer's problem. Sometimes some problems will exceed our support, we need to seek further help from the senior team.
Finally, please rest assured that we will do our best to resolve the posts of each customer, if you have any other question in the further, please feel free to post in our forum.
Please continue to believe in Microsoft, and we will continue to improve our services.
Thanks again for your reply and understanding, wish you all the best!
Best Regards,
Daniel
Please remember to mark the replies as answers if they help.
If you have feedback for TechNet Subscriber Support, contact [email protected].
Tuesday, November 27, 2018 8:49 PM
"They will not help us because the cluster validation report showed an error. It's just unbelievable and not acceptable... They even closed the case already. In the past few years, I opened several cases @ MS, and never, but really NEVER, had a good experience! I was a MS fanboy back in the days but after the last few years, I've really changed my mind... Not going to go in too much detail here, but it's really dissapointing."
I can understand being disappointed, but I can also understand Microsoft's position. If the cluster validation report is showing an error, there is something wrong with the cluster. It is that simple. Errors must be fixed. Warnings are little different. The cluster validation report will report warnings for things that could cause potential problems, but there is also the possibility that the cluster can run most things without issue. However, each warning needs to be investigated for your particular situation. But an error is cut-and-dried. You must fix the error or there is no guarantee the cluster will work properly.
The solution to Microsoft not providing support is for you to fix the error.
tim
Wednesday, November 28, 2018 12:33 PM
more info about the cluster validation error:
in the failover cluster manager, there are 3 cluster networks. 1 of them is used for storage communication to the underlying Nutanix layer. This cluster network is bound to a vswitch on each host, pointing to the same ip address on each host, which is normal and by design.
This is what's giving these errors in the report (duplicate ips)
We could make the errors go away by doing the following: on the properties of the cluster network used for storage: disable cluster communication. This will exclude the cluster network from the validation test. BUT: we are unsure that it will cause interruptions on our production environment so we decided not to change it...
Wednesday, November 28, 2018 2:24 PM | 1 vote
Duplicate IPs is never a good thing, whether in a cluster or not. That is sure to end in unwanted results. Seems a very strange thing to leave in.
In general, you should not have cluster communication on a storage network. You should have a completely separate network for cluster communication. You can allow cluster communication on multiple networks, but my recommendation would be to ensure cluster communication is selected only for management networks, as it is a management protocol. I never would configure cluster communication on a storage network, even as a backup for another network that had cluster communication defined on it. Will it work? Sure. Can it cause issues? Potentially. Is it a recommended practice? Nope.
tim
Thursday, November 29, 2018 6:52 AM
Hi,
This is Daniel and wish you all the best!
just confirm the current status of the issue, if Tim's reply was useful to you, please mark it as an answer!
Thanks for your cooperation.
Best Regards,
Daniel
Please remember to mark the replies as answers if they help.
If you have feedback for TechNet Subscriber Support, contact [email protected].
Thursday, November 29, 2018 11:39 AM
Duplicate IPs is never a good thing, whether in a cluster or not. That is sure to end in unwanted results. Seems a very strange thing to leave in.
In general, you should not have cluster communication on a storage network. You should have a completely separate network for cluster communication. You can allow cluster communication on multiple networks, but my recommendation would be to ensure cluster communication is selected only for management networks, as it is a management protocol. I never would configure cluster communication on a storage network, even as a backup for another network that had cluster communication defined on it. Will it work? Sure. Can it cause issues? Potentially. Is it a recommended practice? Nope.
tim
You're absolutly right (double checked it with Nutanix if it's ok to disable it), and that's why we are going to change the setting (needs to be planned - outside critical hours just in case). Afterwards we'll re-run cluster validation report.
Friday, November 30, 2018 2:46 AM
Hi,
Thanks for your reply!
I will closely monitor this case and if you have any question, please feel free to post in our forum.
Thanks again for your time and have a nice day!
Best Regards,
Daniel
Please remember to mark the replies as answers if they help.
If you have feedback for TechNet Subscriber Support, contact [email protected].