Brand new cluster experiences catastrophic failures when primary MC-LAG peer (Dell VLT) reloads

Ricky Saull 0 Reputation points
2024-08-26T13:59:24.4033333+00:00

This may or may not be considered outside the scope of this support forum, but I'm wondering if there is anyone here who has experience with a cluster attached to Dell switches configured in VLT...

We have a four-node Microsoft Failover Cluster, with each server equipped with a pair of NICs configured in a Switch Embedded Team (SET). Each NIC within the team is connected to one of the two peers in the VLT domain, with a single link per connection. The VLT domain connects to an “access” switch via a VLT port-channel with LACP, facilitating client access.

We have followed best practices and official documentation to ensure that SET and VLT are configured correctly. However, during fault/failure simulations, we consistently observe catastrophic outages affecting the cluster, but only when the tests are conducted against the primary VLT peer. These issues include nodes being dropped from the cluster, VMs failing, crashing, or entering a paused state, and Cluster Shared Volumes (CSVs) disconnecting.

For example, the following conditions will cause our cluster to enter a failed state and lose network connectivity for an unacceptable amount of time:

  • Reloading the primary VLT peer by pulling the power or by issuing the reload command
  • Administratively shutting down all server ports, VLT port-channel uplink and VLTi

The individual links to the servers fail over gracefully. Killing the VLTi on the secondary VLT peer also results in a graceful failover. Reloading the secondary VLT peer causes a graceful failover as well.

We expect each peer to handle failures similarly, but they clearly do not. We have a feeling this is a switch issues, but we're not certain of that. Maybe it could be a configuration issue with the networking of our cluster. We’re out of ideas… and almost out of drywall to bang our heads against. Any assistance would be greatly appreciated.

Windows Server 2019
Windows Server 2019
A Microsoft server operating system that supports enterprise-level management updated to data storage.
3,743 questions
Windows
Windows
A family of Microsoft operating systems that run across personal computers, tablets, laptops, phones, internet of things devices, self-contained mixed reality headsets, large collaboration screens, and other devices.
5,376 questions
Windows Network
Windows Network
Windows: A family of Microsoft operating systems that run across personal computers, tablets, laptops, phones, internet of things devices, self-contained mixed reality headsets, large collaboration screens, and other devices.Network: A group of devices that communicate either wirelessly or via a physical connection.
750 questions
Windows Server Clustering
Windows Server Clustering
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Clustering: The grouping of multiple servers in a way that allows them to appear to be a single unit to client computers on a network. Clustering is a means of increasing network capacity, providing live backup in case one of the servers fails, and improving data security.
995 questions
0 comments No comments
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.