How to increase throughput in ADF copy activity From Table Storage to Table Storage

Tomas Blažauskas 40 Reputation points
2024-09-26T13:10:27.4533333+00:00

Hello,
Short background - I'm using ADF for creation of Table storage Backups for specific Storage Accounts (each account is different pipeline with separate copy activities for each Table), creating new Table storages from old ones using copy activity. I'm running the pipelines on azure integration runtime - AutoResolveIntegrationRuntime that has 4 (+ 4 driver cores) cores. DataFactory and Storage Account are both in the same location.

At the moment throughput varies between 70kbs - 800kbs for different Table storages. I tried changing DIU count, max parallel count, max concurrent connection count and batch size for copy activity, however it had minimal to none impact on throughput. While changing the settings peak connections in sink was always 4, and used parallel copies 3. From what i have observed it seem as writes to sink are main reason for throttling.

I have read trough troubleshooting and performance guides. Additionally I have tried using ForEach activity with copy activity inside but that did not help.

So the question would be are there some other ways to increase throughput in my copy activities for Table storage (for example manual partitions, or preloading the table and chunking it for separate activities) that would help?

My current pipeline:

{
    "name": "pipeline",
    "properties": {
        "activities": [
            {
                "name": "Foreach-activity",
                "type": "ForEach",
                "dependsOn": [],
                "userProperties": [],
                "typeProperties": {
                    "items": {
                        "value": "@pipeline().parameters.tablesToCopy",
                        "type": "Expression"
                    },
                    "isSequential": false,
                    "batchCount": 10,
                    "activities": [
                        {
                            "name": "Copy-activity",
                            "type": "Copy",
                            "dependsOn": [],
                            "policy": {
                                "retry": 0,
                                "retryIntervalInSeconds": 30,
                                "secureOutput": false,
                                "secureInput": false
                            },
                            "userProperties": [],
                            "typeProperties": {
                                "source": {
                                    "type": "AzureTableSource",
                                    "azureTableSourceIgnoreTableNotFound": false,
                                    "maxConcurrentConnections": 4
                                },
                                "sink": {
                                    "type": "AzureTableSink",
                                    "azureTableInsertType": "merge",
                                    "azureTablePartitionKeyName": "PartitionKey",
                                    "azureTableRowKeyName": "RowKey",
                                    "writeBatchSize": 20000,
                                    "maxConcurrentConnections": 16
                                },
                                "enableStaging": false,
                                "parallelCopies": 16,
                                "enableSkipIncompatibleRow": false,
                                "dataIntegrationUnits": 4
                            },
                            "inputs": [
                                {
                                    "referenceName": "DynamicAzureTableDatasetSource",
                                    "type": "DatasetReference",
                                    "parameters": {
                                        "TableName": "@{string(item().Source)}"
                                    }
                                }
                            ],
                            "outputs": [
                                {
                                    "referenceName": "DynamicAzureTableDatasetOutput",
                                    "type": "DatasetReference",
                                    "parameters": {
                                        "TableName": "@{string(item().Destination)}"
                                    }
                                }
                            ]
                        }
                    ]
                }
            }
        ],
        "concurrency": 16,
        "parameters": {
            "tablesToCopy": {
                "type": "Array"
            }
        },
        "annotations": [],
        "lastPublishTime": "2024-09-26T09:34:49Z"
    },
    "type": "Microsoft.DataFactory/factories/pipelines"
}
Azure Table Storage
Azure Table Storage
An Azure service that stores structured NoSQL data in the cloud.
171 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,873 questions
0 comments No comments
{count} votes

Accepted answer
  1. Vinodh247 23,666 Reputation points MVP
    2024-09-30T13:27:00.55+00:00

    Hi Tomas Blažauskas,

    Thanks for reaching out to Microsoft Q&A.

    Looking at your current pipeline configuration, there are a few areas where you could potentially enhance throughput for the copy activity between Table storages.

    Here are a few suggestions that might help improve performance based on your setup:

    • Consider lowering the writeBatchSize if you observe throttling at the sink. Table Storage has a limit on entity group transactions, so if you consistently hit these limits, it could cause slowdowns. Reducing the batch size to around 100 entities per request might help.
    • You can also experiment with increasing maxConcurrentConnections beyond 16 if your table storage can handle more concurrent requests without throttling. Try adjusting this to 32 or 64 and observe the impact.
    • Since you have the concurrency set to 16, your current parallel copies count is aligned. However, increasing parallelCopies could be beneficial if your data is partitioned effectively. You can try increasing it to 32 or 64, depending on how much the sink can handle in parallel without hitting throttling issues.
    • Ensure that your source data is well-distributed across partitions. If most of the writes are concentrated on a few partitions, it may cause bottlenecks at the Table Storage level. If possible, re-distribute data across more partitions for more balanced throughput.
    • Increase the number of DIUs to 8 or 16 for greater parallel processing power. DIUs affect the amount of compute power available for each pipeline, which could help if you're seeing any resource limitations.
    • In some cases, using an intermediate staging location (such as blob storage) might improve the overall performance, especially if direct transfers between Table Storage are hitting bottlenecks. This would allow you to copy data to a faster medium first and then to the sink in smaller chunks. However, this could add overhead, so it’s something to test.
    • Since throttling can cause intermittent slowdowns, it may be useful to increase the retry count to 3–5 with a small interval (retryIntervalInSeconds: 30–60) to help mitigate transient issues due to throttling or timeouts during sink writes.
    • The concurrency setting is aligned with your parallelCopies and maxConcurrentConnections, but if your system can handle more parallel operations, you could try increasing this to 32 or more.
    • The batchCount setting for the ForEach loop controls how many tables are copied concurrently. Since the pipeline is not sequential, increasing this value (ex: to 20 or more) can allow more tables to be processed in parallel if your Table Storage accounts and integration runtime can handle the load.
    • Use azure storage metrics to monitor if you’re hitting table storage throttling limits. If so, further reduce the writeBatchSize or spread out the load by adjusting the concurrency or introducing delays between batch operations.

    Note: You can try these suggestions to narrow down the issues and increase the throughput performance.

    Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.

    2 people found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.