Loading large nested JSON files into Azure SQL db using Azure Data Factory

A K 45 Reputation points
2024-11-22T11:00:34.1533333+00:00

Hi all,

I am currently working with large nested JSON files stored inside a blob container. Each file has a different structure with varying fields and data types, and nested arrays.

My goal is to load these files into an Azure SQL database in an efficient and scalable way while maintaining flexibility for schema drift.

The methods I have tried so far, is to build a pipeline with a get metadata activity, then a for each activity and inside the for each activity, a lookup activity, then a stored procedure. This method worked but just for the smaller files as there is a size limitation from the lookup activity.

I've also tried using a data flow and attempted to flatten the JSON files and serialize the rows using toJsonString(byName('*')) or toJsonString($$) but encountered constant Type Mismatch errors due to the nested structures and mixed data types.

I've reviewed solutions involving the copy data activity, but I'm unclear how to handle raw JSON content without flattening or how to handle schema drift dynamically for files with different structures.

If anyone can please advise or provide guidance on:

  1. What is the best practice to load these JSON files into Azure SQL database.
  2. How can I map the entire JSON content dynamically to a single SQL column (json_content) while also capturing the file name?
  3. Is there a way to bypass the need for flattening or unrolling nested structures in Data Flow?

Thank you in advance!

Azure SQL Database
Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,496 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,923 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Sina Salam 12,976 Reputation points
    2024-11-22T19:33:25.5633333+00:00

    Hello A K,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you have challenges in loading large nested JSON files into Azure SQL DB using Azure Data Factory.

    In the below steps will help to resolve the issues described and is optimized for scalability, schema drift, and large nested JSON structures.

    1. Efficiently loads large JSON files into Azure SQL without the need for flattening by using Copy Activity for Raw JSON Ingestion:
    • Configure the dataset to read the entire JSON file as a string.
    • Set format to JSON.
    • Use wildcard file path if processing multiple files.
    • Map the JSON string to a single column (e.g., json_content) in a SQL table.
    • Use the Copy Data Activity's additional column feature to add file name as metadata.
    • For an example:
    • Add a column in the sink dataset schema (e.g., file_name).
    • Map the file name dynamically using the system variable @item().Name.
    1. Schema drift is inherently supported by the Copy Activity when storing raw JSON in a single column, to handle Schema Drift Dynamically use the JSON structure as stored as-is, to maintaining flexibility. There is no need to define schema in the source or sink datasets.
    2. This is Optional: If additional transformations are required:
    • Use Azure Functions to preprocess large JSON files.
    • Key Features:
      • Parse the nested JSON.
      • Flatten or reformat data dynamically.
      • Write the processed data to another blob storage or directly to Azure SQL.
    1. You will need to optimize for Large Files, for extremely large files:
    • Split Files: Use a pre-processing step (Azure Logic Apps or Azure Functions) to split large JSON files into smaller chunks and process chunks individually with Azure Data Factory.
    • Enable Parallelism: Use the Parallel Copy option in Copy Activity to process multiple files concurrently.
    1. If the JSON files are too complex, you will need to use advanced tools for complex Nested Structures.

    Use Azure Databricks for distributed processing and transformations of JSON data.

    • Load JSON data into a DataFrame.
    • Apply transformations and write back to SQL.

    Use Synapse Analytics for serverless SQL to query JSON files directly in blob storage and transform the data and write to Azure SQL.

    # An example of Pipeline Design
    Pipeline Components:
       - Get Metadata Activity**: Retrieve file list.
       - For Each Activity**:
         - Inside loop:
           - Copy Activity:
             - Source: JSON file (entire content).
             - Sink: Azure SQL table (`json_content` + `file_name`).
    SQL Table Schema:
       - Table Name: `JsonData`
       - Columns:
         - `json_content` (NVARCHAR(MAX)): Stores raw JSON.
         - `file_name` (VARCHAR): Captures the file name.
    

    This is a code snippet for preprocessing in Azure Functions, which is an example of flattening nested JSON using Python as I described above:

    import json
    def flatten_json(json_obj, delimiter='_', prefix=''):
        flattened = {}
        for key, value in json_obj.items():
            if isinstance(value, dict):
                flattened.update(flatten_json(value, delimiter, f"{prefix}{key}{delimiter}"))
            elif isinstance(value, list):
                flattened[f"{prefix}{key}"] = json.dumps(value)  # Convert list to string
            else:
                flattened[f"{prefix}{key}"] = value
        return flattened
    # Example usage
    input_json = {
        "id": 1,
        "info": {"name": "John", "details": {"age": 30, "city": "New York"}},
        "tags": ["developer", "blogger"]
    }
    flattened_json = flatten_json(input_json)
    print(flattened_json)
    

    I hope this is helpful! Do not hesitate to let me know if you have any other questions.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.