How to improve data flow sink performance

Lotus88 116 Reputation points
2025-04-16T06:30:56.5566667+00:00

Hi,

I am using Synapse for ETL. When I only updated 3 records, my sink processing time still need 7 mins.

My target table contains about 6 millions records. I tried to optimize the sink by setting round robin and hash with 8 partitions, but I feel the sink processing time still very high. Is there a way to improve the sink processing time?

Thank you!

User's image

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,305 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 31,391 Reputation points
    2025-04-16T12:22:22.76+00:00

    Hello Lotus88 !

    Thank you for posting on Microsoft Learn.

    The issue you're facing is common in Azure Synapse Data Flows when writing to large sink tables, especially when upserts or small updates are performed on large datasets.

    If you're using a Parquet or Delta Lake format on the sink (in ADLS Gen2), working with staged updates or switching to Delta Lake format will boost performance for incremental writes.

    If you're doing "update if exists, insert otherwise", it leads to costly row-by-row matching.

    Instead, try to filter only updated rows before the sink, and then overwrite or bulk update if the dataset supports it.

    You mentioned using Round Robin and Hash with 8 partitions. If you're writing to a dedicated SQL pool:

    • Match the sink partitioning with the distribution column of your target table (if hash-distributed).
    • Use Sink partitioning: None if writing to a replicated table.
    • 8 partitions may be too low. You could try 32 or 64 depending on your Spark pool size.

    If you are:

    • Joining a small reference table, use broadcast join.
    • Reusing a dataset, use the cache option before the sink to avoid recomputation.

    If your transformation is simple (like filtering or updating just a few records), use Copy activity with stored procedures or Pre-copy script to handle updates via T-SQL. This is way faster than using a full Spark session.

    If writing to Synapse SQL (dedicated pool):

    Use a partitioned table.

    Filter only for the affected partition(s) before writing.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.