How to increase throughput in ADF copy activity From Table Storage to Table Storage

Question

How to increase throughput in ADF copy activity From Table Storage to Table Storage

Tomas Blažauskas 40

Hello,
Short background - I'm using ADF for creation of Table storage Backups for specific Storage Accounts (each account is different pipeline with separate copy activities for each Table), creating new Table storages from old ones using copy activity. I'm running the pipelines on azure integration runtime - AutoResolveIntegrationRuntime that has 4 (+ 4 driver cores) cores. DataFactory and Storage Account are both in the same location.

At the moment throughput varies between 70kbs - 800kbs for different Table storages. I tried changing DIU count, max parallel count, max concurrent connection count and batch size for copy activity, however it had minimal to none impact on throughput. While changing the settings peak connections in sink was always 4, and used parallel copies 3. From what i have observed it seem as writes to sink are main reason for throttling.

I have read trough troubleshooting and performance guides. Additionally I have tried using ForEach activity with copy activity inside but that did not help.

So the question would be are there some other ways to increase throughput in my copy activities for Table storage (for example manual partitions, or preloading the table and chunking it for separate activities) that would help?

My current pipeline:

{
    "name": "pipeline",
    "properties": {
        "activities": [
            {
                "name": "Foreach-activity",
                "type": "ForEach",
                "dependsOn": [],
                "userProperties": [],
                "typeProperties": {
                    "items": {
                        "value": "@pipeline().parameters.tablesToCopy",
                        "type": "Expression"
                    },
                    "isSequential": false,
                    "batchCount": 10,
                    "activities": [
                        {
                            "name": "Copy-activity",
                            "type": "Copy",
                            "dependsOn": [],
                            "policy": {
                                "retry": 0,
                                "retryIntervalInSeconds": 30,
                                "secureOutput": false,
                                "secureInput": false
                            },
                            "userProperties": [],
                            "typeProperties": {
                                "source": {
                                    "type": "AzureTableSource",
                                    "azureTableSourceIgnoreTableNotFound": false,
                                    "maxConcurrentConnections": 4
                                },
                                "sink": {
                                    "type": "AzureTableSink",
                                    "azureTableInsertType": "merge",
                                    "azureTablePartitionKeyName": "PartitionKey",
                                    "azureTableRowKeyName": "RowKey",
                                    "writeBatchSize": 20000,
                                    "maxConcurrentConnections": 16
                                },
                                "enableStaging": false,
                                "parallelCopies": 16,
                                "enableSkipIncompatibleRow": false,
                                "dataIntegrationUnits": 4
                            },
                            "inputs": [
                                {
                                    "referenceName": "DynamicAzureTableDatasetSource",
                                    "type": "DatasetReference",
                                    "parameters": {
                                        "TableName": "@{string(item().Source)}"
                                    }
                                }
                            ],
                            "outputs": [
                                {
                                    "referenceName": "DynamicAzureTableDatasetOutput",
                                    "type": "DatasetReference",
                                    "parameters": {
                                        "TableName": "@{string(item().Destination)}"
                                    }
                                }
                            ]
                        }
                    ]
                }
            }
        ],
        "concurrency": 16,
        "parameters": {
            "tablesToCopy": {
                "type": "Array"
            }
        },
        "annotations": [],
        "lastPublishTime": "2024-09-26T09:34:49Z"
    },
    "type": "Microsoft.DataFactory/factories/pipelines"
}

Accepted answer

0 additional answers

Your answer

Answer 1

Hi Tomas Blažauskas,

Thanks for reaching out to Microsoft Q&A.

Looking at your current pipeline configuration, there are a few areas where you could potentially enhance throughput for the copy activity between Table storages.

Here are a few suggestions that might help improve performance based on your setup:

Consider lowering the writeBatchSize if you observe throttling at the sink. Table Storage has a limit on entity group transactions, so if you consistently hit these limits, it could cause slowdowns. Reducing the batch size to around 100 entities per request might help.
You can also experiment with increasing maxConcurrentConnections beyond 16 if your table storage can handle more concurrent requests without throttling. Try adjusting this to 32 or 64 and observe the impact.
Since you have the concurrency set to 16, your current parallel copies count is aligned. However, increasing parallelCopies could be beneficial if your data is partitioned effectively. You can try increasing it to 32 or 64, depending on how much the sink can handle in parallel without hitting throttling issues.
Ensure that your source data is well-distributed across partitions. If most of the writes are concentrated on a few partitions, it may cause bottlenecks at the Table Storage level. If possible, re-distribute data across more partitions for more balanced throughput.
Increase the number of DIUs to 8 or 16 for greater parallel processing power. DIUs affect the amount of compute power available for each pipeline, which could help if you're seeing any resource limitations.
In some cases, using an intermediate staging location (such as blob storage) might improve the overall performance, especially if direct transfers between Table Storage are hitting bottlenecks. This would allow you to copy data to a faster medium first and then to the sink in smaller chunks. However, this could add overhead, so it’s something to test.
Since throttling can cause intermittent slowdowns, it may be useful to increase the retry count to 3–5 with a small interval (retryIntervalInSeconds: 30–60) to help mitigate transient issues due to throttling or timeouts during sink writes.
The concurrency setting is aligned with your parallelCopies and maxConcurrentConnections, but if your system can handle more parallel operations, you could try increasing this to 32 or more.
The batchCount setting for the ForEach loop controls how many tables are copied concurrently. Since the pipeline is not sequential, increasing this value (ex: to 20 or more) can allow more tables to be processed in parallel if your Table Storage accounts and integration runtime can handle the load.
Use azure storage metrics to monitor if you’re hitting table storage throttling limits. If so, further reduce the writeBatchSize or spread out the load by adjusting the concurrency or introducing delays between batch operations.

Note: You can try these suggestions to narrow down the issues and increase the throughput performance.

Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.

Tomas Blažauskas 40 Reputation points

2024-10-01T06:52:44.8066667+00:00

Thank you for you thorough answer. The only thing I would like to clarify is that in my case during pipeline run only 3-4 parallel copies are used. This low amount of parallelism is most likely caused by poor distribution of partitions in Table Storage.

So the question would be, is it possible to additionally partition the Table Storage, for example using range of timestamp? Or is the only other option using intermediate staging to copy the data in chunks?
Tomas Blažauskas 40 Reputation points

2024-10-01T07:00:43.78+00:00

Some additional information about pipeline performance tuning.

Batch size - tried different batch sizes smaller sizes lower throughput quite a bit (in my test dataset best throughput was ~500-600kbs) while using 100 batch size I get about 300kbs, and from 3000 to 20000 batch size does not affect pipeline throughput too much.

DIU - increasing DIU does not seem to affect throughput too. Only when the batch size is very big 20000 or more, then some increase in throughput is seen increasing DIU (but it still does not justify the cost increase).

Parallelism - as mentioned in comment before, pipeline run uses only 3-4 parallel copies, which is most likely because of poor structure of Table storage PartitionKey.

Share via

How to increase throughput in ADF copy activity From Table Storage to Table Storage

0 additional answers

Your answer