How to create a incremental pipeline in dataflow need to delete records if that is not present in source
Hi Team,
Could you please help us, we have to create incremental pipeline in synapse dataflow, we should have to handle insert, update and delete(if the record not present in source we have to delete in target Azure sql), can you please confirm is it possible to do.
If possible can you please share the steps with the screenshots and the logic, that would helps a lot.
and we have Hugh data like 40 million records, will the incremental dataflow can handle this scenario as faster side and please let us know if we any other ways to increase the process time.
Source - delta file and Target is Azure SQL
Thanks in Advance
Azure Synapse Analytics
-
J N S S Kasyap • 1,620 Reputation points • Microsoft External Staff
2025-04-17T18:20:42.6233333+00:00 Hi @SaiSekhar, MahasivaRavi (Philadelphia)
You can absolutely implement Insert, Update, and Delete logic using
Change Data Capture (CDC)
in Azure Synapse Analytics. CDC is a very efficient way to handle incremental data loads, especially with large datasets like your 40 million records.
Using CDC simplifies the process of detecting changes (inserts, updates, deletes) compared to a watermark-based approach, as it leverages metadata about changes in the Delta file.Performance Optimization for 40 Million Records:
Use
Hash partitioning
on ID in the Delta CDC source and partition the Azure SQL table by ID or LastModifiedDate
Configure Azure Data Lake Storage Gen2 as astaging
area in the Azure SQL sink to minimize direct database load.
Increase Azure SQLDTUs/vCores
(e.g., Premium tier) and Data Flow compute size (e.g., Medium/Large) with Enhanced compute.
Set the sink Writebatch size
to 10,000–100,000 rows for efficient processing.
Create aclustered index
on ID; disable indexes during loading and rebuild afterward using pre/post-SQL scripts.
RunOPTIMIZE
and applyZ-Order
indexing on ID in the Delta table to enhance read performance.Please refer below resources to implement the Insert, Update, and Delete logic using Change Data Capture (CDC) in Azure Synapse Analytics.
https://www.youtube.com/watch?v=YTqu4Ppg66c
https://www.youtube.com/watch?v=02omAy7q2Zw&list=PL6Y5avAYJUaTBaBL0em6qyz6mTVLCiX6P&index=33
Disclaimer: This response contains a reference to a third-party World Wide Web site. Microsoft is providing this information as a convenience to you. Microsoft does not control these sites and has not tested any software or information found on these sites; therefore, Microsoft cannot make any representations regarding the quality, safety, or suitability of any software or information found there. There are inherent dangers in the use of any software found on the Internet, and Microsoft cautions you to make sure that you completely understand the risk before retrieving any software from the Internet.I hope this information helps. Please do let us know if you have any further queries.
Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Thank you.
-
SaiSekhar, MahasivaRavi (Philadelphia) • 140 Reputation points
2025-04-18T05:42:57.4033333+00:00 Hi @J N S S Kasyap,
Thanks for the update,The above videos they were doing for Azure SQL to SQL database, but my requirement is to delta file to Azure SQL, can you please provide is there any solution to do full load in more faster way or can you please provide the steps to do for incremental load that would be more helpful.
we have around 40million rows in delta file, the count will increase on daily basis, if we want to do full load it is taking more than 4 hours(can you please tell is there any techniques to increase the performance)
or can you please tell is there any way to do incremental load for 40million rows which will handle insert, update and delete and please let us know will it increase the performance faster comparative to full load, please help with the solution.Thanks in Advance
-
Singireddi JagadishKumar • 0 Reputation points • Microsoft External Staff
2025-04-21T09:57:34.2933333+00:00 Hello @SaiSekhar, MahasivaRavi (Philadelphia)
To perform an incremental data load from a Delta file to Azure SQL using Synapse Data Flows, a recommended approach involves using a Full Outer Join between the source (Delta) and the target (Azure SQL).
This allows efficient handling of insert, update, and delete operations, even with large datasets exceeding 40 million records. After the join, a Derived Column or Conditional Split is used to identify row types—whether they are inserts, updates, or deletes.
The Alter Row transformation is then applied to route each row type accordingly. The data is finally written to the Azure SQL sink with upsert and delete operations enabled using a key column.
For performance optimization, you can apply partitioning on your preferred partitioning column both the source and target, filter the source data using the
ModifiedDate
column to minimize load volume, and use a compute-optimized Integration Runtime.Additionally, tuning the sink settings such as batch size and degree of parallelism is crucial for handling high-volume loads effectively. For further guidance, refer to resources such as this tip on incremental upserts, this article on incremental loads, and this Synapse-focused guide.
Disclaimer: This response contains a reference to a third-party World Wide Web site. Microsoft is providing this information as a convenience to you. Microsoft does not control these sites and has not tested any software or information found on these sites; therefore, Microsoft cannot make any representations regarding the quality, safety, or suitability of any software or information found there. There are inherent dangers in the use of any software found on the Internet, and Microsoft cautions you to make sure that you completely understand the risk before retrieving any software from the Internet. -
SaiSekhar, MahasivaRavi (Philadelphia) • 140 Reputation points
2025-04-21T11:01:09.2533333+00:00 Thanks for the update,
Can you please provide code for the insert, update and delete logic in alter row what we need to write after full outer join,
s there any way in dataflow to extract incremental data.
Eg: correct me if I'm wrong in the below screenshot there is delta file path for April 20th there are three files, can we process the three files as incremental, those are having latest data, the table having 40million rows but April 20th process data in those 3 files it won't have 40 million records
it will have latest data in those can we process those data as incremental into azure sql.
Can you please provide code to extract latest file data from delta file and load into azure sql using (dataflow or any other way), it need to handle insert, update and delete, is this scenario is achievable or not, can you please confirm. -
Singireddi JagadishKumar • 0 Reputation points • Microsoft External Staff
2025-04-22T08:17:28.87+00:00 Filter by File Timestamp (in Data Flow Source) If the Delta files have a Last Modified Timestamp In your Data Flow source, select wildcards Use a Filter transformation after the source to get only files from April 20:
toDate(toTimestamp(lastModified)) == toDate('2025-04-20')
lastModified
is an implicit metadata field in Synapse Data Flow sources if you enable metadata columns. Full Outer Join and Alter Row Code Join Type: Full Outer Join onContactId
Add Derived ColumnrowAction
Use a Derived Column transformation after the join with the following expression:
iif(isNull(src.ContactId), 'Delete', iif(isNull(tgt.ContactId), 'Insert', iif(src.Name != tgt.Name || src.Email != tgt.Email, 'Update', 'NoChange') ) )
Call this column:
rowAction
Alter Row Expression (in Alter Row Transformation) OperationExpression Insert
rowAction == 'Insert'
UpdaterowAction == 'Update'
DeleterowAction == 'Delete'
You configure this directly in the Alter Row transformation’s settings. Sink Configuration (Azure SQL) Sink Type: Azure SQL Enable: Upsert (for insert/update) Delete Row (for delete) Set Key column(s) =ContactId
Alternative Way to Load Incremental Files Only Instead of filtering in Data Flow, you can: Use a Get Metadata activity in pipeline to get file timestamps Filter file paths with
LastModified >= pipeline(variables.lastRunDate)
Pass filtered file paths to Data Flow via parameters -
SaiSekhar, MahasivaRavi (Philadelphia) • 140 Reputation points
2025-04-22T11:58:23.2133333+00:00 Hi @Singireddi JagadishKumar
Thanks for the update,
In dataflow enable metadata columns and check "Allow wildcards". where can I get the option in below screenshot.
For Alter row For Insert update and delete we have to pass like as below right, could you please confirm.
-
SaiSekhar, MahasivaRavi (Philadelphia) • 140 Reputation points
2025-04-22T13:13:43.25+00:00 Hi @Singireddi JagadishKumar ,
can you please provide update on the above request
-
Singireddi JagadishKumar • 0 Reputation points • Microsoft External Staff
2025-04-22T14:16:34.3933333+00:00 Hello@SaiSekhar, MahasivaRavi (Philadelphia)
You can check the wildcard paths in the Source options.For Alter row For Insert update and delete we have to pass like as below right, could you please confirm.
Yes, you can use same option
-
SaiSekhar, MahasivaRavi (Philadelphia) • 140 Reputation points
2025-04-23T05:32:10.1133333+00:00 Hi @Singireddi JagadishKumar ,
The wild card option is not working, we have tried with below but it is through error,
The folder is storing delta files there is option change data capture is there any way we can use it to extract the latest records.
at Source 'source1': Path /*.parquet does not resolve to any file(s). Please make sure the file/folder exists and is not hidden. At the same time, please ensure special character is not included in file/folder name, for example, name starting with _
can you please help me with whole flow and code in dataflow, please share the screenshot that would really helps a lot.
-
Singireddi JagadishKumar • 0 Reputation points • Microsoft External Staff
2025-04-23T09:31:27.78+00:00 Hello @SaiSekhar, MahasivaRavi (Philadelphia)
Could you please provide the sample Delta file you have been working with, along with a screenshot of the error and details of the exact point where you are facing the issue? I will attempt to reproduce the scenario again on my end and get back to you with the findings. -
SaiSekhar, MahasivaRavi (Philadelphia) • 140 Reputation points
2025-04-23T10:39:54.93+00:00 The folder we have is delta files were saving there, I have created a new dataset with parquet file and in wild card path passing value as *.parquet. First, can you please confirm this, so that we will proceed further to develop.
an you please explain me the below points.
- Delta files if we are extracting the data as incremental without last modified column filed in data, is it possible to do incremental to azure sql(will the delta log can support this).
- Can you provide the logic for incremental update like handling (insert/update(without passing all columns in the logic) /delete) in one dataflow, please provide the logic and code in dataflow to use(The above suggested blog don't have the information).
-
SaiSekhar, MahasivaRavi (Philadelphia) • 140 Reputation points
2025-04-23T11:55:45.9633333+00:00 Can you please provide an update on this please
Sign in to comment