In an high availability configuration of Azure Data Factory, how can I reduce the 5-minute failover/recovery time for nodes that fail during pipeline runs?

Alex WB 0 Reputation points
2025-12-01T00:23:14.81+00:00

I am developing enterprise-grade pipelines that ingest from on-premises data sources using a self-hosted integration runtime and am utilizing a multi-node high-availability configuration in Azure Data Factory. During runs, I've noticed that if a node fails it takes 5 minutes for the node to become available again. This is a long time to wait at-scale (especially in an enterprise environment like the one I am working in). Is there no way to reduce the 5-minute failover/recovery time for nodes that fail during pipeline runs? I'm looking for specific steps if possible.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Smaran Thoomu 32,520 Reputation points Microsoft External Staff Moderator
    2025-12-02T17:18:49.43+00:00

    Hey Alex, it sounds like you’re dealing with some frustrating delays due to node failures in your Azure Data Factory setup. The standard 5-minute failover time is indeed a limitation that can affect your pipeline runs, especially in enterprise-scale applications.

    Here are a couple of strategies you can try to help reduce this downtime:

    1. Enable Parallel Execution: When configuring your Data Flow, make sure to enable the Run in parallel option. This helps process multiple sinks in parallel, which can alleviate startup delays when handling multiple pipelines.
      • In the Azure portal, go to your Data Factory resource, select the Data Flow, and enable the Run in parallel option.
    2. Monitor Cluster Performance: Regularly use Azure Monitor to track the performance of your Integration Runtime clusters. Set up alerts and dashboards to monitor startup times and identify when nodes are struggling. This can help you make timely decisions about resource allocation.
    3. Consider Over-Provisioning Resources: If you’re using a self-hosted integration runtime (SHIR), make sure to spread your resources across multiple availability zones. Over-provisioning can allow your setup to tolerate some degree of node failure without significantly impacting performance.
    4. Increase Compute Resources: If your pipelines are hitting resource limits during peak times, consider increasing the compute power allocated to your IR. This can help manage higher loads and potentially reduce failover times as more resources are available to handle tasks.
    5. Use Retry Policies: Make sure your pipeline activities have appropriate retry policies configured. This way, if a transient failure happens, Azure Data Factory can automatically attempt to rerun the affected steps, minimizing disruption.

    These steps can help you mitigate those annoying delays from node failures, though it's worth noting that some limitations might still apply based on your specific configuration and Azure service conditions.

    If you need more tailored suggestions or if these options do not seem feasible, here are a few follow-up questions to consider:

    • How many nodes are configured in your multi-node setup?
    • Have you checked if there are any ongoing incidents reported in Azure Service Health impacting your region?
    • What is the load on your integration runtime when the failovers occur?
    • Are you utilizing any specific monitoring tools or configurations to assist in diagnosing these issues?

    If you implement these suggestions and continue to face issues, feel free to reach out again for more help! Hope this helps!

    References:

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.