Azure Virtual Machine Scale Set instances aren't repaired even when the automatic repairs policy is enabled

2024-04-10

Azure VMSS instances remain in an "Unhealthy" state and aren't repaired even when the automatic repairs policy is enabled. This article provides possible causes and corresponding solutions for this issue:

Automatic repairs policy isn't correctly enabled in the scale set.
Health monitoring isn't correctly configured in the scale set.
The instance is marked unhealthy due to a provisioning failure.
Automatic repairs have been suspended in the scale set due to too many failed repairs.
The instance is in its grace period.

Automatic repairs policy isn't correctly enabled in the scale set

Confirm that your VMSS is opted into automatic repairs by viewing its service state.

Under the orchestrationServices property, if the serviceState for automatic repairs is Running, the VMSS is opted into automatic repairs.

If the serviceState is NotRunning or the automatic repairs policy doesn't show up under the orchestrationServices property, you must enable the automatic repairs policy in the scale set. For more information, see Enabling automatic repairs policy when updating an existing scale set.

If the serviceState is Suspended, go to Automatic repairs have been suspended in the scale set due to too many failed repairs.

Health monitoring isn't correctly configured in the scale set

If all the instances in the scale set show up as "Unhealthy", it could be a sign that your health monitoring probe isn't configured correctly during setup. Make sure that your application emits the expected HTTP/HTTPS/TCP responses to the configured endpoints.

In order to achieve a "Healthy" status, the application health extension probes or load balancer health probes require, at minimum, a 2xx HTTP(S) response or a successful TCP handshake from your application at the configured endpoint. If the expected response isn't received, an "Unhealthy" status will be reported. Make sure that the correct health signals are emitted by your application to the provided endpoint.

For more information about the expected TCP/HTTP(S) responses for load balancer health probes, see Load Balancer Custom Probes.

For more information about the expected TCP/HTTP(S) responses for application health extension probes, see the "Configure endpoint to provide health status" section in Requirements for using automatic instance repairs.

The instance is marked unhealthy due to a provisioning failure

Use Get Instance View with the API version 2019-12-01 or higher for the VMSS to view the provisioning state of the instances under statusesSummary from the virtualMachine property.

REST API

GET '/subscriptions/{subscriptionId}/resourceGroups/{resourceGroupName}/providers/Microsoft.Compute/virtualMachineScaleSets/{vmScaleSetName}/instanceView?api-version=2019-12-01'

"virtualMachine": {
        "statusesSummary": [
            {
                "code": "ProvisioningState/succeeded",
                "count": 2
            }
        ]
}

If you have a ProvisioningState/failed code under statusesSummary, delete the failed instance and add a new instance to your scale set. Instance repairs currently doesn't support scenarios where a virtual machine is marked "Unhealthy" due to a provisioning failure.

To remove the failed instance from your scale set, see Remove VMs from a scale set.

To add a new instance to your scale set, see Change the capacity of a scale set.

Automatic repairs have been suspended in the scale set due to too many failed repairs

If your application continues to emit an "Unhealthy" signal after repeated repair attempts, the platform will eventually suspend instance repairs as a safety measure by changing the serviceState for automatic repairs to Suspended.

Confirm the serviceState of your automatic repairs policy. To do this, see Viewing and updating the service state of automatic instance repairs policy.

If the serviceState is Suspended, resume automatic repairs by updating the serviceState back to Running by using the setOrchestrationServiceState API and cmdlet examples in Viewing and updating the service state of automatic instance repairs policy.

The instance is in its grace period

If none of the causes above are applicable to the issue, the instance could be in its grace period.

The grace period is the amount of time automatic repairs will wait after any state change on the instance before performing repairs, which helps avoid any premature or accidental repairs. The repair action should happen once the grace period is completed for the instance. For more information on the grace period setting for automatic repairs, see Grace Period.

Contact us for help

If you have questions or need help, create a support request, or ask Azure community support. You can also submit product feedback to Azure feedback community.