Hi @Alex WB
Thank you for reaching out to Microsoft QA. Below are the analysis and detailed mitigation steps to help resolve your issue.
Azure Data Factory (ADF) is a managed service, so you can’t tweak an internal “failover timer” to shave seconds off recovery when a node drops. That 5‑minute window you’re seeing is part of how the platform orchestrates health and rebalancing. The way to keep pipelines moving is to design around that window: spread the load across multiple nodes, add a second IR/region to fall back to, and make your pipelines self‑healing.
-- Run Self‑Hosted IR in active‑active (2–4 nodes)
If one node fails, others keep working—no need to wait for the failed node to come back. ADF natively supports up to four nodes per self‑hosted IR for higher availability and throughput. Set this up and keep nodes healthy (CPU/memory headroom, OS auto‑start of the IR service, standard patching) so the cluster can absorb a node drop gracefully. [learn.microsoft.com], [learn.microsoft.com]
Helpful KBs: • Create & configure self‑hosted IR (Microsoft Learn) – shows multi‑node HA and considerations. • Q&A: High availability/multiple nodes – quick overview of benefits and the four‑node limit. • Step‑by‑step HA cluster setup (community walkthrough).
-- Add pipeline‑level failover between IRs/regions
Treat IRs like lanes on a highway. If the primary lane is blocked, have your pipeline hop to the alternate:
- Use global parameters or activity expressions to reference the IR/linked service dynamically.
- Wrap critical activities with retry + an If Condition/Until pattern that reruns on the secondary IR when the first attempt fails.
- For scheduled triggers that don’t have native pipeline retries, use a small control‑flow scaffold (Wait → re‑Execute Pipeline → cap retries).
Helpful KBs: • Trigger comparison & retry behavior (tumbling vs scheduled). Stack Overflow summary + doc links • Retry patterns for ADF/Synapse (worked examples).
-- Consider Azure IR and cross‑region options (where possible)
If your workloads can run on Azure IR (no on‑prem private network dependency), you get elastic, managed compute and can scale DIUs for copy performance. For Data Flows, ADF supports running in another region as a recovery mechanism, which can help during localized issues.
Helpful KBs: • Integration runtime concepts & types. learn.microsoft.com • DIU scaling guidance for Copy (discussion + doc reference). • Data Flow on cross‑region IR (Microsoft community update). techcommunity.microsoft.com
-- Monitor IR health and automate your response
Don’t wait to discover a node is sick at run‑time. Wire up Azure Monitor alerts on IR status, queue length, CPU/memory, and pipeline failures. Trigger a Logic App/Runbook to flip pipelines to the secondary IR or restart services if needed.
Helpful KBs: • Monitor Integration Runtime (statuses & PowerShell). • Monitor Data Factory (metrics, alerts, diagnostics to Log Analytics).
-- Right‑size and scale the IR cluster
If the cluster is constantly busy, ADF has fewer options when a node disappears. Use IR metrics (CPU, memory, queue length, queue duration) to decide when to scale out (add nodes) or scale up (bigger VMs/concurrent job limits). [stackoverflow.com]
Helpful KB: • Self‑Hosted IR scaling & DR best practices (metrics to watch, dashboard JSON). pythian.com
Why you won’t find a “reduce 5‑minute failover” switch
That recovery period is tied to the platform’s managed orchestration; ADF doesn’t expose a granular knob to shorten it. Microsoft’s reliability guidance points you to redundancy (multi‑node IR), regional diversity, idempotent pipelines, and alerts/automation—all aimed at keeping work moving even while the platform stabilizes. [techcommun...rosoft.com]
Helpful KB: • Reliability in Azure Data Factory (resiliency patterns & SLA notes). GitHub – Microsoft Docs
A concrete blueprint you can implement this week
- Primary SHIR (Region A): 3 nodes (active‑active) on separate VMs; IR service set to auto‑start; patching in rotation.
- Secondary SHIR (Region B): 2 nodes, same configuration; linked services duplicated; credentials in Key Vault.
- Pipeline control‑flow:
- Execution activities: retry = 2–3, retryIntervalInSeconds = 60–120.
- On failure path: switch linkedServiceName to secondary IR and re‑execute the specific step.
- Alerting & automation: Azure Monitor alerts on IR Offline, QueueLength, ActivityFailedRuns → Action Group triggers Logic App to mark primary “unhealthy” and route subsequent runs to the secondary IR for the next 15–30 minutes.
- Capacity hygiene: Watch CPU/memory and queue duration; if sustained high, add a node (up to four) or increase concurrent jobs on the node.
A quick note on VM Scale Sets
Customers often ask if SHIR can live on VM Scale Sets to auto‑scale. Today, VMSS isn’t supported for SHIR installation; you scale by adding discrete nodes (VMs) up to four per IR.
If your workloads can use Azure IR
Prefer Azure IR (managed) whenever your data paths allow—it removes most node management concerns, and you can tune DIU for throughput rather than worrying about failover. Pair that with cross‑region options for Data Flows if needed.