How to improve data flow sink performance

Question

How to improve data flow sink performance

Lotus88 176

Hi,

I am using Synapse for ETL. When I only updated 3 records, my sink processing time still need 7 mins.

My target table contains about 6 millions records. I tried to optimize the sink by setting round robin and hash with 8 partitions, but I feel the sink processing time still very high. Is there a way to improve the sink processing time?

Thank you!

User's image

1 answer

Your answer

Answer 1

Amira Bedhiafi 33,866 Volunteer Moderator

Hello Lotus88 !

Thank you for posting on Microsoft Learn.

The issue you're facing is common in Azure Synapse Data Flows when writing to large sink tables, especially when upserts or small updates are performed on large datasets.

If you're using a Parquet or Delta Lake format on the sink (in ADLS Gen2), working with staged updates or switching to Delta Lake format will boost performance for incremental writes.

If you're doing "update if exists, insert otherwise", it leads to costly row-by-row matching.

Instead, try to filter only updated rows before the sink, and then overwrite or bulk update if the dataset supports it.

You mentioned using Round Robin and Hash with 8 partitions. If you're writing to a dedicated SQL pool:

Match the sink partitioning with the distribution column of your target table (if hash-distributed).
Use Sink partitioning: None if writing to a replicated table.
8 partitions may be too low. You could try 32 or 64 depending on your Spark pool size.

If you are:

Joining a small reference table, use broadcast join.
Reusing a dataset, use the cache option before the sink to avoid recomputation.

If your transformation is simple (like filtering or updating just a few records), use Copy activity with stored procedures or Pre-copy script to handle updates via T-SQL. This is way faster than using a full Spark session.

If writing to Synapse SQL (dedicated pool):

Use a partitioned table.

Filter only for the affected partition(s) before writing.

Lotus88 176 Reputation points

2025-04-17T02:07:09.9466667+00:00

Hi @Amira Bedhiafi , my source and sink are Azure SQL tables. I am using upsert i.e."update if exists, insert otherwise". My main data source and change data are from different SQL DB tables. My sink is another DB. I tried to join my main data source and change data so that I can get the full records of the change data then update/insert to my sink. Before the sink, the data is already filtered and in this case there is only 3 records.

I did not use dedicated sql pool as it is too costly. The key that I used for hash is my record id.

I did not use Copy activity because my main source, change source and sink are from different Azure SQL DB. Is there any other method I can use to improve the performance?
Venkat Reddy Navari 3,470 Reputation points Microsoft External Staff Moderator

2025-04-18T10:03:47.6866667+00:00
@Lotus88 Since you're working with multiple Azure SQL Databases (source, change data, and sink), and only performing small upserts (e.g., 3 records), it's understandable that you're seeing performance overhead when using Data Flows, as each run spins up a full Spark session.

To improve sink performance in this scenario, here are two recommended approaches:

Option 1: Use Copy Activity with Pre-copy SQL Script

For lightweight updates, using Copy Activity is significantly more efficient than Spark-based Data Flows. Here's how:

Retrieve only the filtered change records (e.g., your 3 updates).

Configure a Copy Activity to move data from your change table to the sink.

Add a Pre-copy script to perform MERGE/UPSERT logic directly in the target Azure SQL DB.

This approach works across different Azure SQL DBs using Azure IR or self-hosted IR.

This method is optimized for performance and cost when working with small deltas.

Option 2: Leverage Stored Procedure Activity

Lookup or lightweight Data Flow to read change data.

A Stored Procedure Activity in your pipeline to execute the upsert logic in the sink.

This avoids the overhead of Data Flows and gives you direct control over update logic through T-SQL.

Both approaches are effective for reducing processing time when updating small volumes of data across databases without relying on a dedicated SQL pool.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Lotus88 176 Reputation points

2025-04-18T11:54:58.0566667+00:00

Hi,

My change data only contain the record id and it is located in DB 2. The source is located in DB 1. I need to get the full records of the change data from source in DB 1 based on record id. How can that achieve with both options beside using data flow?

The change records may not may not be small. Depends if users mass upload the data. My sink is located in DB 3.
Venkat Reddy Navari 3,470 Reputation points Microsoft External Staff Moderator

2025-04-21T09:08:19.4733333+00:00
@Lotus88 Since your change data (only record IDs) is in DB2, full source records are in DB1, and the sink is in DB3, and since the volume of changes may vary (sometimes large), you're right that Data Flows may add unnecessary overhead — especially for small updates.

Here are a couple of approaches that can help improve performance without using Data Flows:

Option 1: Lookup + Copy Activity + Stored Procedure

This approach works well for small-to-medium-sized updates.

Lookup Activity: Read record IDs from your change table in DB2.

ForEach Activity: Loop through each record ID (or batches), and use a Copy Activity to fetch the full records from DB1 using a parameterized query:
SELECT * FROM SourceTable WHERE RecordID = @RecordID

Sink to DB3: Instead of using a regular sink, call a Stored Procedure Activity to perform the UPSERT logic using T-SQL in your target DB.

For larger batches, copy all matching records into a staging table in DB3 and then use a stored procedure to perform a bulk MERGE.

Option 2: Azure Function or Logic App (for scalability and flexibility)

If you want more control and potentially better scalability:

Use an Azure Function or Logic App to:

Read change IDs from DB2.

Fetch corresponding records from DB1 in batches.

Perform bulk UPSERT into DB3 (using SqlBulkCopy or MERGE).

This reduces Spark overhead and gives you full control over the logic, especially if record volumes vary greatly.
Both methods avoid the heavy Spark startup cost of Data Flows and give you better performance for both small and large delta updates.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Lotus88 176 Reputation points

2025-04-22T08:57:10.32+00:00

Hi,

I tried this method. Instead of doing upsert of the change data to sink (target table in DB3), I insert the changed data in a staging sink (in DB3). Then outside the data flow I used copy activity to upsert change data in stage table to target table in DB3. However, the processing time in the data flow for my staging sink still need 10 mins for just 3 records. The stage table is empty initially. It is worst than my first solution. Why is that so?

What is the purpose of data flow if it is so slow ?
Venkat Reddy Navari 3,470 Reputation points Microsoft External Staff Moderator

2025-04-22T12:50:09.6566667+00:00

@Lotus88 Based on what you've shared, the long sink processing time (over 10 minutes for just 3 records) is not due to slow SQL performance, but rather the inherent overhead of using Data Flows in Azure Synapse. Every time you run a Data Flow, it spins up a full Spark session, creates temporary tables, partitions the data, and manages orchestration tasks behind the scenes. These steps introduce a lot of latency, even when you're only processing a handful of rows, which is why you're seeing a significant delay even with minimal data movement.

You mentioned that you already tried staging the changes in a separate table in DB3 and using Copy Activity afterward to upsert into the final sink. That's a great step in the right direction. However, the key issue is that the staging step itself still uses a Data Flow, which means you're not avoiding the Spark overhead — and that’s where most of your processing time is being spent.

Given that your data sources are spread across multiple Azure SQL Databases (DB1, DB2, DB3) and the change data may vary in size, I’d recommend removing the Data Flow altogether for small to moderate updates. Instead, consider using a Copy Activity to directly pull the full records from DB1 (based on IDs in DB2), write them to a staging table in DB3, and then use a Stored Procedure Activity to handle the actual upsert logic in DB3 using T-SQL. This method eliminates the Spark dependency entirely, significantly improves performance, and gives you more precise control over your operations.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Lotus88 176 Reputation points

2025-04-23T01:41:45.0233333+00:00

Hi @Venkat Reddy Navari ,

Sometimes I will have 20K no. of records. The bottleneck here seems is how I can get full data from DB1 where ID come from DB2. If I used for loop activity, the loop will need to do 20K times. Is this consider inefficient?

The problem is I cannot put my id change data table in DB 1 as DB 1 need to keep clean with no working tables used for processing. Do you think use notebook can help?

Thank you!
Venkat Reddy Navari 3,470 Reputation points Microsoft External Staff Moderator

2025-04-23T10:11:10.5566667+00:00
@Lotus88 You're absolutely right to be cautious about using For Each when dealing with a large number of records like 20K. Looping through each ID individually to fetch data from DB1 would be highly inefficient and can cause serious performance overhead.

Since your change data (IDs) resides in DB2 and you need to retrieve the corresponding full records from DB1 — while keeping DB1 clean without any staging tables — the best approach is to use a set-based query instead of row-by-row logic. If both DB1 and DB2 can be accessed by the same Integration Runtime, you can use a single query in a Copy Activity or Lookup Activity that performs a join across databases, such as:

SELECT s.* FROM DB1.dbo.SourceTable s JOIN DB2.dbo.ChangeIDTable c ON s.RecordID = c.RecordID

This approach pulls only the matching records from DB1 in one operation, eliminating the need for looping or staging.

However, if cross-database queries aren’t permitted or feasible in your setup, then using a Notebook is a great alternative. You can read the list of IDs from DB2 into memory and then query DB1 with a dynamic IN clause or via parameterized filtering, all without writing anything to DB1. This keeps your environment clean and is more scalable, especially as your data volumes grow. Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Lotus88 176 Reputation points

2025-04-23T12:12:54.5+00:00

Hi,

I don't quite understand across join 2 DB. This is the part I got stuck. Do I setup the link in the Azure SQL DBs or somewhere in Synapse ? Do you have an example? Thank you !
Venkat Reddy Navari 3,470 Reputation points Microsoft External Staff Moderator

2025-04-23T15:52:59.58+00:00
@Lotus88 When I mentioned querying across two Azure SQL databases, the key requirement is that both databases need to be accessible from the same Azure Integration Runtime (IR), such as the AutoResolve IR in Synapse Pipelines.

In Azure Synapse, you don’t need to set up a formal link between the databases like a linked server in traditional SQL Server. Instead, you can reference both databases directly in your query, as long as:

Both databases are in the same Azure SQL Server (i.e., same logical server name)

You have connection strings set up for each database, and you're using Lookup Activity or a notebook where you can dynamically switch connections.

Example using Synapse SQL on the same server:

If DB1 and DB2 are under the same Azure SQL logical server, you can run a query like:

SELECT s.* FROM [DB1].[dbo].[SourceTable] s JOIN [DB2].[dbo].[ChangeTable] c ON s.ID = c.ID

This works inside a Lookup, Copy Activity (source query), or even a Notebook using JDBC connection.

If your databases are on different servers, you can't directly cross-query them in a single T-SQL statement. In that case, a good approach is:

Use Copy Activity or Notebook to first load the list of change IDs from DB2 into memory (like a Pandas DataFrame).

Then, use that list to query DB1 dynamically by using a SQL IN (...) clause or by passing it as a parameterized filter.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Lotus88 176 Reputation points

2025-04-24T01:42:06.59+00:00

Hi @Venkat Reddy Navari, DB1 and DB2 are not in the same server. I think I can only try notebook.
Venkat Reddy Navari 3,470 Reputation points Microsoft External Staff Moderator

2025-04-24T10:06:38.54+00:00

@Lotus88 Thanks for confirming! Since DB1 and DB2 are on different servers, using a Synapse notebook with Python (e.g., pyodbc or sqlalchemy) is a good choice. You can connect to both databases within the same notebook, fetch the change IDs from DB2, query the full records from DB1, and then write the result to DB3.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Venkat Reddy Navari 3,470 Reputation points Microsoft External Staff Moderator

2025-04-25T10:54:26.3266667+00:00

@Lotus88 Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

How to improve data flow sink performance

1 answer

Your answer