How to delete automatically migrated blobs (with Data Factory) based on original last modified data?

Domenico Fasano 0 Reputation points
2024-09-30T13:09:07.0733333+00:00

I have a Storage Account with a lifecycle management policy. I need to migrate the blobs in a new Storage Account and ensuring that these blobs get deleted after x days on the original last modified data. What can I do?

So far, I managed to create a Data Factory pipeline with a copy activity and I see there is a way to create a custom metadata with the original last modified data (the $$LASTMODIFIED value). What is the best way to exploit it for what I want to achieve? Is there a better/simpler solution?

I think one can use Azure Functions to delete the blob, but I have milions of blobs and a function go on timeout after 5/10 minutes.

Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,834 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,663 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Sina Salam 10,416 Reputation points
    2024-09-30T13:29:52.6533333+00:00

    Hello Domenico Fasano,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you would like to delete automatically migrated blobs (with Data Factory) based on original last modified data.

    Follow these links for best practices to clean up files by built-in delete activity in Azure Data Factory https://azure.microsoft.com/en-us/updates/clean-up-files-by-built-in-delete-activity-in-azure-data-factory and how you can work with the Delete Activity in Azure Data Factory [https://www.sqlservercentral.com/articles/working-with-the-delete-activity-in-azure-data-factory](https://www.sqlservercentral.com/articles/working-with-the-delete-activity-in-azure-data-factory

    )

    I hope this is helpful! Do not hesitate to let me know if you have any other questions.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    0 comments No comments

  2. Vinodh247 20,141 Reputation points
    2024-09-30T13:44:41.4266667+00:00

    Hi Domenico Fasano,

    Thanks for reaching out to Microsoft Q&A.

    To achieve automatic deletion of migrated blobs based on their original last modified date while considering large volumes of blobs, here’s a suggested approach using ADF, lifecycle management policies, and potentially azure functions for scalability.

    Copying Blobs with Metadata

    • You have already mentioned using adf to migrate blobs and capturing the '$$lastmodified' value in custom metadata for the copied blobs in the destination storage account. This will work as it preserves the original last modified date. The key is to make sure this metadata gets applied to the new blobs in the target storage account.
    • In the ADF copy activity, ensure the '$$LASTMODIFIED' value is being captured and applied correctly as custom metadata. You can configure this in the copy activity under "Mapping" and store it with a custom name like 'OriginalLastModifiedDate'.

    Implement Lifecycle Management in the Destination Account

    • Azure Storage supports lifecycle management policies that can delete blobs based on conditions such as their last modified date or custom metadata values.
      • Option A: If you can set the blob’s 'Last Modified' date to match the original modified date, you can directly apply a lifecycle management policy in the destination Storage Account to delete blobs after 'x' days of inactivity.
      • Option B: If you are storing the original 'Last Modified' date in metadata (ex: 'OriginalLastModifiedDate'), you can create a lifecycle management policy that uses this metadata for deletion rules. Unfortunately, lifecycle management doesn't directly support metadata-based conditions yet, so you’ll need to rely on another method, like Azure Functions, to handle deletions based on metadata.

    Azure Functions for Metadata-Based Deletion

    If the lifecycle policy cannot handle your custom metadata-based expiration, you can leverage Azure Functions with a timer trigger to delete the blobs based on the 'OriginalLastModifiedDate'. Since your concern is with timeouts and large volumes, consider using durable functions to split the workload into smaller, scalable tasks.

    Here’s how:

    Durable Functions allow orchestration of long-running tasks and scale efficiently to process large datasets (millions of blobs).

    The function can:

    1. List blobs from the destination Storage Account.
    2. Check each blob’s 'OriginalLastModifiedDate' metadata.
    3. Delete blobs that exceed the retention threshold.

    Schedule this durable function to run at regular intervals (ex: daily) to delete blobs in batches, avoiding timeout issues.

    Alternative: Logic Apps for Long-Running Operations

    If you prefer a low-code option and wish to avoid writing custom function code, Azure Logic Apps can provide a workflow-based approach for deleting blobs based on metadata:

    • Logic Apps can run indefinitely without timeout issues and can process blobs in chunks.
    • Use a Logic App with a "List blobs" action, filter blobs by the 'OriginalLastModifiedDate' metadata, and delete blobs that meet the condition.

    Scalable and Efficient Workflow

    Here’s a summary of the steps:

    • Migration with ADF: Ensure the '$$LASTMODIFIED' value is stored as metadata ('OriginalLastModifiedDate') during the copy operation.
    • Azure Storage Lifecycle Policy: If possible, apply a lifecycle policy based on the modified date (if it fits your requirement).
    • Durable Functions/Logic Apps: If custom metadata is required for deletion logic, set up a Durable Function or Logic App to check the 'OriginalLastModifiedDate' and delete expired blobs in scalable batches.

    Additional Considerations

    Monitoring and Logging: Ensure you have proper logging in place for tracking which blobs are deleted and if any failures occur.

    Cost: Consider the potential cost of listing and processing millions of blobs, especially if the lifecycle of these blobs is short and you need to delete them frequently.

    This approach balances scalability and maintainability, especially when dealing with large volumes of blobs and ensuring they are deleted according to their original last modified date.

    Please 'Upvote'(Thumbs-up) and 'Accept' as an answer if the reply was helpful. This will benefit other community members who face the same issue.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.