An Apache Spark-based analytics platform optimized for Azure.
Hey Arpit Shukla,
it sounds like you’re moving roughly 5 TB of data through an Azure Databricks ETL every day via Azure Data Factory. That’s a serious workload, so you’ll want to tune everything end-to-end—from cluster sizing to failure handling. Here’s a collection of best practices you can apply:
- Cluster sizing & configuration • Start with a medium-to-large Spark cluster—e.g. 32-core driver + 16-core workers or larger—and enable autoscaling (min/max nodes) so you don’t over-provision overnight but can scale up under peak load. • Use spot/low-priority VMs for worker nodes if your jobs are fault-tolerant (retry on preemption). • Configure auto-termination (e.g. 15–30 minutes idle) to control costs. • Tune Spark configs to boost parallelism—e.g.
spark.sql.shuffle.partitionsto match your core count, and ensure you’re not artificially limiting parallel tasks. - Partitioning strategy • Ingest data into time-based partitions (e.g.
year=,month=,day=folders) so downstream reads and deletes are scoped to the latest partitions. • Within Spark, avoid manualcoalesce(1)or single-partition writes—use hash or dynamic range partitioning on your high-cardinality keys to evenly distribute data. • Only repartition when you detect skew (through Spark UI); default partitioning is usually best. - Delta Lake optimization • After your daily write, run Delta’s
OPTIMIZEcommand on hot tables to compact small files into large Parquet files (~256 MB–1 GB each). • UseZORDER BYon columns you frequently filter or join. • Schedule regularVACUUMjobs (e.g. daily) to purge stale files and keep storage efficient. - Incremental loading • Leverage Delta Time Travel or a watermark column (e.g.
ingest_timestamp) to only process new/changed data. • Ingestion frameworks: use Auto Loader (cloudFiles) to continuously pick up new files with schema inference and incremental markers. • If you’re running batch daily, filter your source on the date partition or watermark so you only read the 5 TB of “new” data, not historical. - File compaction & maintenance • Create a dedicated Databricks job that runs nightly/weekly to compact small files on large tables. • Avoid excessive tiny file creation by tuning your streaming/batch micro-batch sizes (for structured streaming) or merge strategies in batch.
- Parallel notebook execution • Break your ETL into logical stages and spin up multiple Databricks job tasks or separate notebooks in parallel. • In ADF, use the “For Each” activity with notebook tasks and set a degree of parallelism to match your cluster’s cores. • Or convert to a Databricks multi-task job and kick it off from ADF.
- Retry & restart logic in ADF • In your ADF pipeline’s Databricks activity, set “Retry” count and interval—e.g. 3 attempts with a 5-minute wait. • Use ADF’s “On Failure” or “Until” paths to trap errors, send alerts, and optionally re-ingest only failed partitions.
- Monitoring & alerting • Use Azure Monitor + Log Analytics to capture cluster metrics (CPU, memory, shuffle read/write). • Enable Spark UI for each run to pinpoint long-running stages or skewed partitions. • In ADF, configure pipeline alerts on failed runs, high activity duration, or missed SLAs.
- Cost control • Right-size clusters via autoscale and auto-terminate. • Where possible, use spot instances for workers. • Monitor your Databricks DBU spend and set budgets/alerts in Azure Cost Management. • Archive cold data off high-performance storage tiers.
- Handling failures & idempotency
• Make your ETL idempotent: use merge/upsert patterns in Delta rather than blind overwrites.
• If a job fails halfway, record processed partitions in a control table so retries can pick up where they left off.
• In ADF, leverage checkpoints (e.g. storing last-processed date in a metadata table) so you don’t reprocess everything.
Hope that gives you a solid playbook—tweak sizes and frequencies based on your actual run times and SLAs. Good luck!
Reference list
- Troubleshoot long-running Databricks notebooks in ADF: https://learn.microsoft.com/azure/databricks/spark-performance
- Azure Monitor for Databricks & ADF: https://learn.microsoft.com/azure/azure-monitor/
- Optimize Data Flows in Azure Data Factory: https://learn.microsoft.com/azure/data-factory/concepts-data-flow-performance
- Monitor ADF pipelines: https://learn.microsoft.com/azure/data-factory/monitor-visually
- Architecture best practices for Azure Databricks: https://learn.microsoft.com/azure/well-architected/service-guides/azure-databricks?wt.mc_id=knowledgesearch_inproduct_azure-cxp-community-insider#performance-efficiency
- Delta Lake OPTIMIZE & VACUUM: https://learn.microsoft.com/azure/databricks/delta/optimizations
- Auto Loader (CloudFiles) incremental ingestion: https://learn.microsoft.com/azure/databricks/ingestion/cloud-object-storage/auto-loader/