Share via

"How can i optimize a 5 TB daily ETL workload in Azure Databricks orchestrated through Azure Data Factory?"

Arpit Shukla 21 Reputation points
2026-05-17T06:25:04.67+00:00

How can I optimize a 5 TB daily ETL workload in Azure Databricks orchestrated through Azure Data Factory?

Please include specific recommendations for cluster sizing, partitioning strategy, Delta Lake optimization, incremental loading, file compaction, parallel notebook execution, retry/restart logic, monitoring, cost control, and handling failures within the ADF pipeline.

Azure Databricks
Azure Databricks

An Apache Spark-based analytics platform optimized for Azure.

0 comments No comments

4 answers

Sort by: Most helpful
  1. Pilladi Padma Sai Manisha 8,405 Reputation points Microsoft External Staff Moderator
    2026-05-18T04:12:15.1066667+00:00

    Hey Arpit Shukla,

    it sounds like you’re moving roughly 5 TB of data through an Azure Databricks ETL every day via Azure Data Factory. That’s a serious workload, so you’ll want to tune everything end-to-end—from cluster sizing to failure handling. Here’s a collection of best practices you can apply:

    1. Cluster sizing & configuration • Start with a medium-to-large Spark cluster—e.g. 32-core driver + 16-core workers or larger—and enable autoscaling (min/max nodes) so you don’t over-provision overnight but can scale up under peak load. • Use spot/low-priority VMs for worker nodes if your jobs are fault-tolerant (retry on preemption). • Configure auto-termination (e.g. 15–30 minutes idle) to control costs. • Tune Spark configs to boost parallelism—e.g. spark.sql.shuffle.partitions to match your core count, and ensure you’re not artificially limiting parallel tasks.
    2. Partitioning strategy • Ingest data into time-based partitions (e.g. year=, month=, day= folders) so downstream reads and deletes are scoped to the latest partitions. • Within Spark, avoid manual coalesce(1) or single-partition writes—use hash or dynamic range partitioning on your high-cardinality keys to evenly distribute data. • Only repartition when you detect skew (through Spark UI); default partitioning is usually best.
    3. Delta Lake optimization • After your daily write, run Delta’s OPTIMIZE command on hot tables to compact small files into large Parquet files (~256 MB–1 GB each). • Use ZORDER BY on columns you frequently filter or join. • Schedule regular VACUUM jobs (e.g. daily) to purge stale files and keep storage efficient.
    4. Incremental loading • Leverage Delta Time Travel or a watermark column (e.g. ingest_timestamp) to only process new/changed data. • Ingestion frameworks: use Auto Loader (cloudFiles) to continuously pick up new files with schema inference and incremental markers. • If you’re running batch daily, filter your source on the date partition or watermark so you only read the 5 TB of “new” data, not historical.
    5. File compaction & maintenance • Create a dedicated Databricks job that runs nightly/weekly to compact small files on large tables. • Avoid excessive tiny file creation by tuning your streaming/batch micro-batch sizes (for structured streaming) or merge strategies in batch.
    6. Parallel notebook execution • Break your ETL into logical stages and spin up multiple Databricks job tasks or separate notebooks in parallel. • In ADF, use the “For Each” activity with notebook tasks and set a degree of parallelism to match your cluster’s cores. • Or convert to a Databricks multi-task job and kick it off from ADF.
    7. Retry & restart logic in ADF • In your ADF pipeline’s Databricks activity, set “Retry” count and interval—e.g. 3 attempts with a 5-minute wait. • Use ADF’s “On Failure” or “Until” paths to trap errors, send alerts, and optionally re-ingest only failed partitions.
    8. Monitoring & alerting • Use Azure Monitor + Log Analytics to capture cluster metrics (CPU, memory, shuffle read/write). • Enable Spark UI for each run to pinpoint long-running stages or skewed partitions. • In ADF, configure pipeline alerts on failed runs, high activity duration, or missed SLAs.
    9. Cost control • Right-size clusters via autoscale and auto-terminate. • Where possible, use spot instances for workers. • Monitor your Databricks DBU spend and set budgets/alerts in Azure Cost Management. • Archive cold data off high-performance storage tiers.
    10. Handling failures & idempotency

    • Make your ETL idempotent: use merge/upsert patterns in Delta rather than blind overwrites.

    • If a job fails halfway, record processed partitions in a control table so retries can pick up where they left off.

    • In ADF, leverage checkpoints (e.g. storing last-processed date in a metadata table) so you don’t reprocess everything.

    Hope that gives you a solid playbook—tweak sizes and frequencies based on your actual run times and SLAs. Good luck!

    Reference list

    Was this answer helpful?


  2. Sina Salam 29,516 Reputation points Volunteer Moderator
    2026-05-17T14:22:53.65+00:00

    Hello Arpit Shukla,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you would like to know how you can optimize a 5 TB daily ETL workload in Azure Databricks orchestrated through Azure Data Factory.

    You will need to put into consideration running a large daily workload with right Databricks job compute, incremental ingestion, Delta Lake layout, compaction strategy, bounded parallelism, restart logic, and monitoring controls.

    For best practice, follow the steps:

    • Use Azure Databricks job compute, not all-purpose clusters, for scheduled production ETL workloads.
    • Process only incremental data, instead of reprocessing the full 5 TB every day.
    • Store all durable ETL layers in Delta Lake using a Bronze/Silver/Gold pattern.
    • Use liquid clustering for new Delta tables where possible, instead of blindly partitioning or ZORDERing every table.
    • Run OPTIMIZE and VACUUM as controlled maintenance operations, not as unbounded work inside the main ETL path.
    • Control Azure Data Factory notebook parallelism to avoid Databricks job throttling and unnecessary cluster pressure.
    • Configure ADF retries, timeouts, failure branches, and Databricks idempotent writes so failed stages can be safely rerun.
    • Monitor using ADF Monitor, Databricks Jobs/System Tables, and Azure Monitor for root-cause analysis and cost visibility.

    After moving the workload to Databricks job compute, applying incremental ingestion, optimizing Delta table layout, bounding ADF parallel notebook execution, and adding restartable/idempotent processing, the ETL pipeline becomes more reliable, easier to troubleshoot, and significantly more cost-controlled for a daily 5 TB workload.

    Use the below resource links for more reading and steps:

    I hope this is helpful! Do not hesitate to let me know if you have any other questions, steps or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    Was this answer helpful?

    0 comments No comments

  3. kagiyama yutaka 3,420 Reputation points
    2026-05-17T09:50:34.0133333+00:00

    I think u keep a 5TB/day ETL fast by touching the partitions that seem to have changed and letting Delta do the incremental merge, with autoscale, daily OPTIMIZE, and ADF fan‑out+retries keeping each step light enough to recover cleanly.

    Was this answer helpful?

    0 comments No comments

  4. AI answer

    2026-05-17T06:25:30.1966667+00:00

    For a 5 TB daily ETL workload in Azure Databricks orchestrated by Azure Data Factory (ADF), optimization focuses on compute configuration, data layout, Delta Lake features, and cost controls.

    Cluster sizing and autoscaling

    • Use workload-specific clusters instead of a single generic large cluster. For heavy ETL, choose job clusters sized for high memory/CPU; for lighter tasks, use smaller clusters to avoid overprovisioning.
    • Enable automatic scaling for variable workloads so clusters grow and shrink based on load, with clear min/max node limits to control costs.
    • For unpredictable or bursty ETL, consider serverless jobs for parts of the pipeline to eliminate idle time costs and pay only for compute used.

    Partitioning strategy

    • Organize Delta Lake tables with partitioning aligned to common filters (for example, ingestion date or business date) so Spark can prune partitions and avoid full-table scans.
    • Ensure partition granularity is balanced: too many small partitions cause metadata overhead; too few create large partitions that slow queries.
    • Align data layout with typical query patterns to minimize full-table scans.

    Delta Lake optimization

    • Use Delta Lake features like table statistics and the cost-based optimizer to improve query planning.
    • Run OPTIMIZE with Z-ordering on large Delta tables, choosing Z-order columns that match frequent filter and join predicates (for example, date, customer, or key columns). This improves data skipping and reduces I/O.
    • Use Photon acceleration for supported workloads to reduce compute time through vectorized execution.

    Incremental loading

    • Minimize data movement and reprocessing by implementing incremental ETL: process only new or changed data instead of full reloads.
    • Use partition-based processing (for example, process only the latest date partition) and maintain change tracking in metadata tables.

    File compaction and file size control

    • Use Delta Lake compaction to merge many small files into fewer large files, reducing storage overhead and improving query performance.
    • Enable auto compaction on write workloads:
      • At table level: set autoOptimize.autoCompact table property.
      • At session level: set spark.databricks.delta.autoCompact.enabled.
    • Prefer the auto option for auto compaction to let Databricks autotune target file size based on table size.
    • Adjust spark.databricks.delta.autoCompact.maxFileSize and spark.databricks.delta.autoCompact.minNumFiles as needed to control compaction behavior.

    Caching and query efficiency

    • Turn on Delta Cache for clusters that repeatedly access the same datasets to accelerate reads using SSD-based caching.
    • Use efficient file formats like Parquet with appropriate compression (for example, Zstandard) to reduce I/O and storage.
    • Avoid unnecessary full-table scans; rely on partition pruning, predicate pushdown, and join optimization.

    Parallel notebook execution and pipeline design

    • Use workload-specific compute configurations and separate job clusters per major ETL stage to parallelize independent steps.
    • Apply patterns like Compute Resource Consolidation and Queue-Based Load Leveling to balance load across clusters.

    Retry/restart logic and failure handling in ADF

    • Design ADF pipelines so each ETL stage is idempotent and partition-based, allowing safe retries of failed partitions without reprocessing the entire 5 TB.
    • Implement retry policies at activity level and ensure Delta operations can be re-run without data duplication (for example, using MERGE patterns and partition overwrite semantics).
    • Use automated spending guardrails and policies to limit cluster size and lifetime, preventing runaway jobs.

    Monitoring and cost control

    • Optimize cluster utilization by enabling automatic termination for idle clusters and encouraging cluster sharing where appropriate.
    • Use Databricks usage monitoring and system tables to track cluster usage and DBU consumption, and set cost alerts and budgets via Cost Management.
    • Monitor performance metrics (runtime, shuffle, cache hit ratios) and adjust partitioning, file sizes, and cluster sizes accordingly.
    • Implement automated data lifecycle policies (for example, VACUUM and retention rules) to remove old data versions and reduce storage costs.
    • Move cold, infrequently accessed data to cheaper storage tiers and use lifecycle rules for archival.

    Network and I/O

    • Choose high-performance storage (for example, SSD-backed) and design the architecture to process data close to where it is stored, minimizing data movement and bandwidth costs.

    These combined practices help sustain a 5 TB daily ETL workload with predictable performance and controlled cost in Azure Databricks orchestrated by ADF.


    References:

    AI-generated content may be incorrect. Read our transparency notes for more information.

    Was this answer helpful?

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.