All our S2D Clusters suddenly freeze and fail when one node is offline

Question

All our S2D Clusters suddenly freeze and fail when one node is offline

Jemandanders 15

Hello,

we are running 12 hyperconverged Hyper-V / S2D Failover Clusters on Server 2019 which were installed during the last 3 years and everything was running fine and stable until a few month ago.

3 of them are 2 node clusters with 4-way mirrored / nested-resiliency with HDDs for capacity and Journal/Cache NVMes.

9 of them are 3 node clusters with 3-way mirrored all-flash SSD storage.

All nodes use distinguished file-share wittnesses.

All of them now have the same issue since a few months (i guess since windows updates):

When a node goes offline (no matter if planned or a failure) after a few minutes to hours the S2D Storage and all VMs on the remaining nodes become slower and slower until the point where they don't react anymore and may even fail completely and also can't be started again.

A few minutes after the missing node ist back online, everthing becomes fast again and failed VMs can be started again. (Even before the Storage Repair Jobs are finished)

This is catastrophic to us, because instead of high availability, we now have 1 failing node taking out 2 or 3 nodes in total.

The Problem seems to be worse the more "load" is on the cluster, while even verry little load is enough. Talking about 1 digit % CPU load an 1 digit MB/s on the storage. It seems the remaining nodes are waitig for something they can't reach or get stuck in something they can't handle and don't or slowly respont to anything else in the meantime. I can't find any reasons in the logs and there is no unusal load on the hosts visible.

The symptoms are:

VMs:

become slower and slower.
Even simple things like opening Start menu, Windows Explorer, or just opening another Folder or clicking a buttion inside a software first takes seconds to respond, then minutes, then dozens of minutes.
and finally the VM crashes:
- it's not reachable over the network
- the VM Connection / Console Windows does not respond at all, can't even click shutdown/reboot/start.
- Also can't click anything in the vms dropdown-menu in failover cluster manager.
  - Can't even open new VM Console Windows in failover cluster manager.
  - Hyper-V Manger doesn't show any vms, just displays "there are no vms on this host"
  - Failover Cluster Manager shows some of them as failed, others as suspended, some as running (while none is responding)
if the failed node is back online, some VMs just resume, others are off, others are in error state but can now be started again.

C:\ClusterStorage....

first opens slowly, again starts with seconds, minutes, and finally shows just empty folders or errors while powerhsell and Failovercluster Manager shows the virtual disks as online/degraded but NOT failed or datached.

Powershell:

everything related to Hyper-V and S2D takes very long, again seconds to minutes to dozenz of minutes. (VirtualDisk, PhysicaDisk, StorageJob, etc)
Get-VirtualDisk shows the disks as degraded and never as failed or detached, even when all VMs are crashed and C:\ClusterStorage cant be accessed. It must be some timeout problem.

There seems to pile up a queue on the S2D storage because the remaining nodes don't respond while waiting for something they won't get or are stuck in something they can't do until the failed node ist back.

There are entries in the event log, but none of them is a source of the problem, just the outcomes, like:

Hyper-V-StorageVSP

Event-ID 9:

I/O-Request for "C:\ClusterStorage....\VM-Name...vhdx" took 549629 milliseconds.

This is displayed for READ and WRITE for all VMs multiple times per second.

The time varies but is above 100000 right from the first occurence.

In the case im looking at right now, these events as well as the slowing down of vms started around 13 Minutes after softly shutting down a node.

Does anyone have a solution or guidance where to look? I don't see any hints what the remaining nodes are busy with while not answering the storage and vm requests.

I've read articles about something like this happening in Server 2016 after updates, and the workaround was putting all disks into maintenance mode before taking a node offline. (While i don't see the matching Event-ID in the logs.) Maintenance mode seems to make the problem better (can also be coincedene, since I only did this outside working hours) but doesn't solve it completely. Also if a node fails unplannend, we can't put the disks in maintanance mode afterwards. And even if we manually could, we still would have downtime in the meantime. Also the problem sometimes happens just after we take disks out of maintenance mode again.

Thanks

Girg G 0 Reputation points

2024-08-29T07:59:07.6+00:00

Dear colleagues, was it possible to determine what the problem was, did the installation of Windows 2022 solve the problem?
Jemandanders 15 Reputation points

2024-08-29T09:17:19.9966667+00:00

Hi, i have installed 2022 on one of the 2 Node 4 way mirror Clusters. It seems to run stable, but i havent tested it under heavy load yet, since these are all production systems.

Also i've read trough a bunch of howtos, best pratices and the microsoft website again, but everybody does it different, especially when it comes to networking, details and optimizations.

I will reinstall some more systems with 2022 and might even do a clean install of server 2019 with the most recent iso on the systems which dont support 2022.

I will post results, but it could take some more month
Girg G 0 Reputation points

2024-08-29T11:43:32.6433333+00:00

Does the cluster work stably when one of the nodes fails?
Jemandanders 15 Reputation points

2024-08-29T11:59:51.8966667+00:00

Until now it did. but I didn't take down a node while on heavy load until now. Will test this soon
Jemandanders 15 Reputation points

2024-09-03T20:38:13.98+00:00

Have just testet this under normal load (100-200mb/s on the Storage, 10-20% CPU load, 50% RAM).
And the storage is freezing again, even under 2022.
It's wasn't as worse as on 2019, but that could be coincidence.
Cluster Storage slowed down to a few kb/s and even hung at 0, but the machines didn't crash this time. Everthing went to normal when the offline node was resumed. Even before the storage job was complete.
Rene Hartmann 0 Reputation points

2025-01-23T07:24:35.35+00:00

Here the same.

AzureStack HCI S2D 2 Node Cluster with 2022.

When a Node goes Offline, HA is funktional but some Hours later the last Node will get slower and slower. In a result of not usable.

If we reboot the last Node then all Cluster Storage are offline. It isnt possibel to force a online Command. Only when the first Node is available the Store get back in a functional State.

Has anybody a Solution for this Behavior ?
Bunk, Alexander 11 Reputation points

2025-01-26T13:34:20.3733333+00:00

We have the same problem in our 2 node cluster with Windows Server 2019. When it was installed in 2021 all worked fine. Same as @Jemandanders it suddently started and we now have this behaviour since 2 years or so. Haven't found any solution yet
RICHARD PLOECHL 5 Reputation points

2025-06-04T12:47:19.03+00:00

Hi!

Is there any solution to this problem?

It turned out, that at least one of our clusters has the same problem: It is not a problem during normal CAU updating but was a problem when we shut down the nodes for longer time for cleaning of the server hardware.

The problems start 20 min after shurdown of the other node, 2 node cluster with S2D.

This was not the case in earlier times, maybe brought by some update.

Best regards,

Richard
Jemandanders 15 Reputation points

2025-06-05T07:31:31.2+00:00

Hi,

sadly not.
By now I was able to reproduce the issue on every S2D Cluster I could get my hands on. Regardless of old or new firmware, drivers and windwos updates.
Tried it with 2019, 2022 and 2025 on different hardware from different vendors at different S2D certfication levels and it happens everywhere.
The only thing they share ist beeing switchless connected for cluster traffic. But it doesnt look like a network issue.
It also doenst matter if it's one big volume or several small or even just one small.
It looks like there ist only a limited amount of changed blocks or so while a node is in maintenance or offline which S2D can take before the storage freezes.

Could you share some detail about your setup?

best regards
RICHARD PLOECHL 5 Reputation points

2025-08-13T11:00:41.0466667+00:00

Sorry for the late answer.

We have two network connections direct between the two nodes for Cluster traffic, no switch.

We have a third network connection on each node connected to a switch, connecting the nodes and the other hosts (PCs).

Best regards,

Richard

2 answers

Your answer

Girg G 0 Reputation points

2024-08-29T07:59:07.6+00:00

Dear colleagues, was it possible to determine what the problem was, did the installation of Windows 2022 solve the problem?
Jemandanders 15 Reputation points

2024-08-29T09:17:19.9966667+00:00

Hi, i have installed 2022 on one of the 2 Node 4 way mirror Clusters. It seems to run stable, but i havent tested it under heavy load yet, since these are all production systems.

Also i've read trough a bunch of howtos, best pratices and the microsoft website again, but everybody does it different, especially when it comes to networking, details and optimizations.

I will reinstall some more systems with 2022 and might even do a clean install of server 2019 with the most recent iso on the systems which dont support 2022.

I will post results, but it could take some more month
Girg G 0 Reputation points

2024-08-29T11:43:32.6433333+00:00

Does the cluster work stably when one of the nodes fails?
Jemandanders 15 Reputation points

2024-08-29T11:59:51.8966667+00:00

Until now it did. but I didn't take down a node while on heavy load until now. Will test this soon
Jemandanders 15 Reputation points

2024-09-03T20:38:13.98+00:00

Have just testet this under normal load (100-200mb/s on the Storage, 10-20% CPU load, 50% RAM).
And the storage is freezing again, even under 2022.
It's wasn't as worse as on 2019, but that could be coincidence.
Cluster Storage slowed down to a few kb/s and even hung at 0, but the machines didn't crash this time. Everthing went to normal when the offline node was resumed. Even before the storage job was complete.
Rene Hartmann 0 Reputation points

2025-01-23T07:24:35.35+00:00

Here the same.

AzureStack HCI S2D 2 Node Cluster with 2022.

When a Node goes Offline, HA is funktional but some Hours later the last Node will get slower and slower. In a result of not usable.

If we reboot the last Node then all Cluster Storage are offline. It isnt possibel to force a online Command. Only when the first Node is available the Store get back in a functional State.

Has anybody a Solution for this Behavior ?
Bunk, Alexander 11 Reputation points

2025-01-26T13:34:20.3733333+00:00

We have the same problem in our 2 node cluster with Windows Server 2019. When it was installed in 2021 all worked fine. Same as @Jemandanders it suddently started and we now have this behaviour since 2 years or so. Haven't found any solution yet
RICHARD PLOECHL 5 Reputation points

2025-06-04T12:47:19.03+00:00

Hi!

Is there any solution to this problem?

It turned out, that at least one of our clusters has the same problem: It is not a problem during normal CAU updating but was a problem when we shut down the nodes for longer time for cleaning of the server hardware.

The problems start 20 min after shurdown of the other node, 2 node cluster with S2D.

This was not the case in earlier times, maybe brought by some update.

Best regards,

Richard
Jemandanders 15 Reputation points

2025-06-05T07:31:31.2+00:00

Hi,

sadly not.
By now I was able to reproduce the issue on every S2D Cluster I could get my hands on. Regardless of old or new firmware, drivers and windwos updates.
Tried it with 2019, 2022 and 2025 on different hardware from different vendors at different S2D certfication levels and it happens everywhere.
The only thing they share ist beeing switchless connected for cluster traffic. But it doesnt look like a network issue.
It also doenst matter if it's one big volume or several small or even just one small.
It looks like there ist only a limited amount of changed blocks or so while a node is in maintenance or offline which S2D can take before the storage freezes.

Could you share some detail about your setup?

best regards
RICHARD PLOECHL 5 Reputation points

2025-08-13T11:00:41.0466667+00:00

Sorry for the late answer.

We have two network connections direct between the two nodes for Cluster traffic, no switch.

We have a third network connection on each node connected to a switch, connecting the nodes and the other hosts (PCs).

Best regards,

Richard

Answer 1

Net Runner 625

We had similar problems with the 4+ node clusters (including S2D-ready nodes). I think there is a problem with the storage resync/rebalance queue that does not switch to use available nodes/storage and keeps building up to the point where the entire cluster gets stuck in 3/4-way mirrored scenarios. We never had such a problem with 2-way mirrored storage pools.

Possible fixes/workarounds we used to stabilize our customer's environments:

Switch to 2-way mirroring where applicable https://learn.microsoft.com/en-us/azure-stack/hci/concepts/nested-resiliency (I believe you need four nodes for this to work reliably).
Upgrade to Windows Server 2022 (never witnessed that behavior on the latest Windows Server).
Replace S2D with Virtual SAN https://www.starwindsoftware.com/vsan.

Jemandanders 15 Reputation points

2024-06-04T13:24:49.33+00:00

Hi,

thanks for your reply.
We only have 3 and 4 way mirrors and can't switch to 2.

I will try updating to 2022, but half of our hosts only support up to Server 2019.
We have more and more problems emerging in Server 2019 which never happend before an which don't happen in 2022.

I will try a fresh install with the most recent server 2019 setup for the others, hope that helps.

Answer 2

Anonymous

Hi Jemandanders,

Hope you're doing well.

As you suspect updates might be the cause, first verify if there have been any recent updates that coincide with the start of your issues. Check if any updates related to Hyper-V, Failover Clustering, or Storage Spaces Direct have been installed.
Check the health of your S2D cluster using "Get-ClusterPerf" and "Get-ClusterLog". The cluster log can provide detailed information about what's happening when nodes go offline. Then use the "Test-Cluster" cmdlet to run a full diagnostic test on your cluster.
Ensure that your network configuration is optimal for S2D and Failover Clustering. Any network issues can cause significant delays in I/O operations. Verify network performance and latency between the nodes using tools like "ping" and "Test-Cluster" cmdlet.
Check event logs for any errors or warnings related to clustering, storage, and Hyper-V.
Ensure that all firmware and drivers, especially for storage and network components, are up to date. ble.

Best Regards,

Ian Xue

If the Answer is helpful, please click "Accept Answer" and upvote it.

Jemandanders 15 Reputation points

2024-06-04T13:20:09.2433333+00:00

Hi Ian,

thanks for your reply. Sadly we can't figure out which update caused this, because we cant really tell since when this happens. We usually restart Nodes outside of business hours, where the issue is not noticeable.

All logs don't show the root cause of this, just timeouts for the storage. I've updated all dirvers and firmware, but the problem stays the same.
Bunk, Alexander 11 Reputation points

2025-01-26T13:36:01.92+00:00

Same problem here. Systems are updated with the newest driver/firmware. In our case this also happens outside of business hours.

Share via

All our S2D Clusters suddenly freeze and fail when one node is offline

2 answers

Your answer