Hello Lotus88 !
Thank you for posting on Microsoft Learn.
The issue you're facing is common in Azure Synapse Data Flows when writing to large sink tables, especially when upserts or small updates are performed on large datasets.
If you're using a Parquet or Delta Lake format on the sink (in ADLS Gen2), working with staged updates or switching to Delta Lake format will boost performance for incremental writes.
If you're doing "update if exists, insert otherwise", it leads to costly row-by-row matching.
Instead, try to filter only updated rows before the sink, and then overwrite or bulk update if the dataset supports it.
You mentioned using Round Robin and Hash with 8 partitions. If you're writing to a dedicated SQL pool:
- Match the sink partitioning with the distribution column of your target table (if hash-distributed).
- Use Sink partitioning: None if writing to a replicated table.
- 8 partitions may be too low. You could try 32 or 64 depending on your Spark pool size.
If you are:
- Joining a small reference table, use broadcast join.
- Reusing a dataset, use the cache option before the sink to avoid recomputation.
If your transformation is simple (like filtering or updating just a few records), use Copy activity with stored procedures or Pre-copy script to handle updates via T-SQL. This is way faster than using a full Spark session.
If writing to Synapse SQL (dedicated pool):
Use a partitioned table.
Filter only for the affected partition(s) before writing.