Replicated guest stuck merging on Hyper-V Core 2016

Article
2017-06-21

Question

_{Wednesday, June 21, 2017 12:51 PM | 4 votes}

Since upgrading some host servers to Hyper-V Core 2016, I've been having some problems with replication coupled with checkpoints.

In my present example, I have a replicated Hyper-V guest (Gen2, V5.0). We run a nightly backup process which will take a checkpoint of the replica, copy the VHDX files to a backup store then delete the checkpoint.

Some nights, the guest seems to get stuck applying replica changes and will not allow the checkpoint to be deleted. Our backup process will wait until the machine has finished applying changes, but in this case it never happens and the replica guest is stuck with "Applying Registered Delta..." in the Status column of Hyper-V Manager. Replication becomes critical and changes are no longer replicated as the backlog is too great.

NOTE: This has also occurred once with "Applying Replica Changes..." in the column

Through the front-end I have the ability to right-click and select "Cancel Applying Replica Changes", but I have used this in the past and it has no effect other than removing the option from the menu. The only solution I've so far discovered is to completely reboot the host.

I have attempted to restart the VMMS service as similar issues in the past have resolved themselves after doing this, but in this case the service freezes with the status "Stopping" and can no longer be interacted with. I had to do a cold reboot to recover as the host got stuck doing a soft reboot.

I'm a bit at a wits end now as this has halted my upgrade plans until I can resolve this problem, and I'm unable to find a anyone posting about exactly the same issue in Hyper-V 2016.

I have found similar posts where people have suggested the problem is Windows Defender. I did try uninstalling this windows feature but it didn't make a difference. After this, I found another post (on these forums) which suggested the issue was Windows Defender NOT running when a third party anti-virus was also installed. I can confirm that prior to uninstalling, Windows Defender was running with the default installation settings and there was no third party AV. I am wondering if perhaps Windows Defender behaves differently on Hyper-V Core as opposed to Windows Server 2016.

The forum post was here: https://social.technet.microsoft.com/Forums/en-US/feb3e9a6-9520-461f-bd4c-79be0a7a2837/hyperv-server-2016-replication-issues-stuck-on-creating-a-reference-point?forum=winserverhyperv

Other points of interest:

This is not specific to only one of the hosts or one of the guests
The issue will occur whether or not the guest is v5 or v8
This issue will occur whether or not the source host is Hyper-V 2012 or Hyper-V 2016
This issue did not/does not occur on Hyper-V Core 2012R2
There is plenty of disk space remaining (>600GB)
Replication is setup to keep 24 hourly checkpoints and changes are replicated every 30 seconds.
The host has several other running guests and replicated guests, but the issue has occurred on a host where it had only one replicated guest.
The latest windows updates have been applied.
This does mostly seem to happen almost exclusively on large guests (>600GB) which are likely to be replicating the most changes. I have had it occur once on a smaller guest.

All replies (18)

_{Monday, July 30, 2018 11:17 AM ✅Answered}

I wanted to chime in that I have been fighting this issue almost since the OP. Roughly once a week I have to reboot a random hypervisor to get the replica's to merge and when I go to reboot I 100% of the time have to remotely kill vmms with psexec.exe as it won't reboot by itself. There is one server that I have to do almost every other day. 10% of the time I have to completely delete and reset up the replica after a reboot. Every month I watch the the current update release notes and hope it will be fixed. I wanted to post you are not alone and I wanted to be in this thread in case others posted more info.

Hi Chance,

Thanks so much for your post. Your mention of PSExec got me thinking, and I've just managed to recover a server which has this problem.

Simply run "psexec.exe /i /s cmd.exe", then do "taskkill /f /pid ###" on the PIDs for the vmms server and the WinMgmt service, and then restart both the services.

This should clear all the stuck delta problems and let you resync replicas!

Cheers,

Kez

_{Thursday, June 22, 2017 7:28 AM}

Hi VSKez,

This issue is under test, we'll feedback as soon as we get any update.

Best Regards,

Anne

Please remember to mark the replies as answers if they help.
If you have feedback for TechNet Subscriber Support, contact [email protected].

_{Thursday, June 22, 2017 10:00 AM}

Hi VSKez,

1. You may refer to the reason for the failure in the following blog (apply to Server 2012R2):

https://blogs.technet.microsoft.com/virtualization/2014/04/24/backup-of-a-replica-vm/

2. In order to get a high success rate of the backup, please try to use DPM:

https://blogs.technet.microsoft.com/dpm/2014/04/24/backing-up-of-replica-vms-using-dpm/

Check if it could help to resolve the issue.

Best Regards,

Anne

Please remember to mark the replies as answers if they help.
If you have feedback for TechNet Subscriber Support, contact [email protected].

_{Thursday, June 22, 2017 11:14 AM}

Hi Anne,

Sorry, thanks for your answer but I'm not entirely sure how this helps?

The issue is not with the backup process, as the backups taken are absolutely fine. We have been doing this for 4 years without any issue on Hyper-V Core 2012R2, but all servers which have been upgraded to Hyper-V Core 2016 are locking up.

The problem is that the Hyper-V Virtual Machine Management service becomes entirely unresponsive while applying replica changes to a replicated guest which has a checkpoint. I suspect the original .vhdx files being copied has something to do with it, however while a checkpoint is in place, the original .vhdx files shouldn't be modified, so it might be something to do with disk contention perhaps.

Either way, this is certainly a new issue for Hyper-V 2016.

Thanks,

Kez

_{Tuesday, July 4, 2017 2:56 PM}

Just a quick update. I've had to start the process of downgrading back to HyperV 2012 R2 as this was an untenable situation.

Hopefully this will be resolved in a future update or the R2 release.

_{Monday, September 11, 2017 5:09 PM | 2 votes}

Just for everyone's reference: as of Sept 11, this remains an ongoing problem - in my case with Server 2016 Datacenter, not as core. Otherwise, the symptoms and fixes are exactly as described by <g class="gr_ gr_11 gr-alert gr_spell gr_inline_cards gr_disable_anim_appear ContextualSpelling ins-del multiReplace" data-gr-id="11" id="11">VSKez</g>.

Stuck with the issue until a fix is found; can't roll back to 2012 R2.

Kind of surprised this isn't being broadly reported as an issue - thus willing to accept that perhaps I'm "doing something wrong" but what that might be I know not. BTW, not new at Hyper-V, replication, etc. I'm rarely stumped.

Note that the VM service hangs on a restart - such as after updates - so having a hardware remote management solution in place (such as IDRAC on Dell hardware) is necessary unless one can be on-site to restart hardware.

_{Friday, September 15, 2017 1:47 AM | 1 vote}

I have a similar problem.
In my case,Primary host is Windows Server 2012 R2 Std,and Replica host is Windows Server 2016 Std.
Once a problem occurs, restarting the vmms service will not proceed while stopping and will not be resolved until the OS is restarted.
Replica host is a simple configuration and there is no running anti-virus software without Windows Defender.
Does anyone else have the same problem?

_{Tuesday, November 14, 2017 2:06 PM}

I'm running into the same thing here. Source and destination hosts are both Windows Server 2016 Standard with the Hyper-V role installed. Using Altaro to backup the replica VM, and last night these started showing up in the event logs. There seems to be no way to cancel the replication, or delete the temporary checkpoint left behind by Altaro.

'SERVERNAME' timed out while waiting to perform the 'Cleaning up stale reference point(s)' operation. The virtual machine is currently performing the following operation: 'Apply Registered Delta'.

When this happens and you reboot your host, are you then able to delete the checkpoint and resume replication? I don't want to reboot if it won't help the situation :)

_{Wednesday, November 29, 2017 11:00 AM}

I can confirm rebooting always worked for me.

I'm disappointed to hear this is also happening with Altaro as I have been looking into this as a backup solution which would allow us to upgrade to Hyper-V 2016.

I'm really surprised that this hasn't been addressed yet!

_{Wednesday, November 29, 2017 1:47 PM}

I can confirm rebooting always worked for me.

I'm disappointed to hear this is also happening with Altaro as I have been looking into this as a backup solution which would allow us to upgrade to Hyper-V 2016.

I've only had it happen once (maybe twice, see below) so far. After rebooting, I ended up having to manually delete the stuck checkpoint using PowerShell (Remove-VMSnapshot), then remove replication, and setup replication again. However, when I re-enabled replication, I chose to do the initial replication using an existing VM on the destination host, and it was able to do it that way, so at least I didn't have to do a full initial replication all over again.

The second time I'm not sure was the same issue. I once again had a stuck checkpoint from Altaro, but I was able to cancel the merge via the Hyper-V management console, and then remove it again with Remove-VMSnapshot, and replication worked fine without a reboot.

Altaro has been pretty solid, and I don't think this is an Altaro bug, but more a Hyper-V bug. What backup solution are you currently using?

_{Thursday, November 30, 2017 11:54 AM}

I can confirm rebooting always worked for me.

I'm disappointed to hear this is also happening with Altaro as I have been looking into this as a backup solution which would allow us to upgrade to Hyper-V 2016.

I've only had it happen once (maybe twice, see below) so far. After rebooting, I ended up having to manually delete the stuck checkpoint using PowerShell (Remove-VMSnapshot), then remove replication, and setup replication again. However, when I re-enabled replication, I chose to do the initial replication using an existing VM on the destination host, and it was able to do it that way, so at least I didn't have to do a full initial replication all over again.

The second time I'm not sure was the same issue. I once again had a stuck checkpoint from Altaro, but I was able to cancel the merge via the Hyper-V management console, and then remove it again with Remove-VMSnapshot, and replication worked fine without a reboot.

Altaro has been pretty solid, and I don't think this is an Altaro bug, but more a Hyper-V bug. What backup solution are you currently using?

Thanks for the update. We currently use an internally built backup solution, which has been fine but is starting to buckle under the strain.

Yesterday I reinstalled Hyper-V 2016 on a server which only hosts a couple of replica guests, but this time I've used ReFS as this allows near instant creation and removal of checkpoints.

So far, so good - no issues yet encountered. Will update if I do have any problems.

Out of interest, is your Hyper-V 2016 server up to date on Windows Updates?

Cheers,

Kez

_{Tuesday, December 12, 2017 9:44 AM}

I can confirm that this STILL happens when using ReFS, but it is significantly less likely, I think due to the fact that it's simply not in the merging state for so long. I started replicating a huge guest to the 2016 server yesterday, and today I found it had both failed to replicate and was stuck merging a different replica.

Would someone from Microsoft please acknowledge this issue. It's significantly impacting our ability to keep our infrastructure up to date and presently there doesn't seem to be any resolution in sight. We are simply unable to run any production servers on Hyper-V 2016.

_{Friday, February 9, 2018 4:36 AM}

Try not disable Windows Defender (Yeah, also check your GPO setting), restart ,

once that's enabled run the following in powershell (to add exclusions)

Set-MpPreference -ExclusionPath c:\clusterstorage, %ProgramData%\Microsoft\Windows\Hyper-V, %ProgramFiles%\Hyper-V, %SystemDrive%\ProgramData\Microsoft\Windows\Hyper-V\Snapshots, "%Public%\Documents\Hyper-V\Virtual Hard Disks"

Set-MpPreference -ExclusionProcess %systemroot%\System32\Vmwp.exe, %systemroot%\System32\Vmms.exe -Force

Set-MpPreference -ExclusionExtension *.vhd, *.vhdx, *.avhd, *.avhdx, *.vsv, *.iso, *.rct, *.vmrs, *.vmcx

The above will work especially on event ID 19062 in Hyper-V-VMMS/Admin

****************************************************************************************************

Timeout for "VM****" while waiting for execution of the "Create a reference point" operation. The virtual machine is currently doing the following: "Deprecated reference points are cleaned". (ID of virtual machine: ******)

protocoll-Name: Microsoft-Windows-Hyper-V-VMMS/Admin
source: Hyper-V-VMMS
event-id: 19062
Level: error
user: SYSTEM

****************************************************************************************************

_{Friday, February 23, 2018 2:20 PM}

Tried this out on my host but ran into the same issue again last night. I'm going to try setting up a Scheduled Task to pause replication during the backup window to see if that helps.

_{Friday, May 4, 2018 7:19 AM}

Hi there, another one here with the same problem.

Replica stops at "Applying registered delta..." and the only way to recover it is doing a hard reboot as the server hangs trying to stop VMMS.

Various 2016 receiving replicas from 2016 and 2012R2, all with storage tiering, dedup and NTFS.

Windows Defender uninstalled to be sure that isn't messing with it.

Sometimes get stuck after the reboot, applies some HRL and stops.

Currently we have a check looking for the last time log was applied, if more than a few hours, alert to be rebooted, as before we had problems with "Too many logs pending..." and crashed the replication

MS has said anything?

We have one or two events each week, hopefully we only have upgraded to 2016 on a handful servers that have mostly replicas or test VMs. So we can help to debug it (as we have done in the past)

Cheers,

Sergi

_{Monday, May 21, 2018 3:41 PM | 1 vote}

I wanted to chime in that I have been fighting this issue almost since the OP. Roughly once a week I have to reboot a random hypervisor to get the replica's to merge and when I go to reboot I 100% of the time have to remotely kill vmms with psexec.exe as it won't reboot by itself. There is one server that I have to do almost every other day. 10% of the time I have to completely delete and reset up the replica after a reboot. Every month I watch the the current update release notes and hope it will be fixed. I wanted to post you are not alone and I wanted to be in this thread in case others posted more info.

_{Wednesday, July 10, 2019 7:19 AM}

I experience this issue too, which is really annoying. I upgraded the Hyper-V 2016 core to Hyper-V 2019 core but even then the issue occurs. @Microsoft: please solve this!!

_{Tuesday, April 7, 2020 2:52 PM}

We're still seeing this in 2019 as well. Primarily on a system where Altaro is installed but backups are not being taken. Mostly replica VM's (that also have extended replication configured).

Share via

Replicated guest stuck merging on Hyper-V Core 2016

Question

All replies (18)

Additional resources