Loading large nested JSON files into Azure SQL db using Azure Data Factory

Question

Loading large nested JSON files into Azure SQL db using Azure Data Factory

A K 45

Hi all,

I am currently working with large nested JSON files stored inside a blob container. Each file has a different structure with varying fields and data types, and nested arrays.

My goal is to load these files into an Azure SQL database in an efficient and scalable way while maintaining flexibility for schema drift.

The methods I have tried so far, is to build a pipeline with a get metadata activity, then a for each activity and inside the for each activity, a lookup activity, then a stored procedure. This method worked but just for the smaller files as there is a size limitation from the lookup activity.

I've also tried using a data flow and attempted to flatten the JSON files and serialize the rows using toJsonString(byName('*')) or toJsonString($$) but encountered constant Type Mismatch errors due to the nested structures and mixed data types.

If anyone can please advise or provide guidance on:

What is the best practice to load these JSON files into Azure SQL database.
How can I map the entire JSON content dynamically to a single SQL column (json_content) while also capturing the file name?
Is there a way to bypass the need for flattening or unrolling nested structures in Data Flow?

Thank you in advance!

Chandra Boorla 14,685 Reputation points Microsoft External Staff Moderator

2024-11-25T01:54:44.64+00:00

@A K

Just checking in to see if the below answer provided by @ Sina Salam helped.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

1 answer

Your answer

Chandra Boorla 14,685 Reputation points Microsoft External Staff Moderator

2024-11-25T01:54:44.64+00:00

@A K

Just checking in to see if the below answer provided by @ Sina Salam helped.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

Hello A K,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you have challenges in loading large nested JSON files into Azure SQL DB using Azure Data Factory.

In the below steps will help to resolve the issues described and is optimized for scalability, schema drift, and large nested JSON structures.

Efficiently loads large JSON files into Azure SQL without the need for flattening by using Copy Activity for Raw JSON Ingestion:

Configure the dataset to read the entire JSON file as a string.
Set format to JSON.
Use wildcard file path if processing multiple files.
Map the JSON string to a single column (e.g., json_content) in a SQL table.
Use the Copy Data Activity's additional column feature to add file name as metadata.
For an example:
Add a column in the sink dataset schema (e.g., file_name).
Map the file name dynamically using the system variable @item().Name.

Schema drift is inherently supported by the Copy Activity when storing raw JSON in a single column, to handle Schema Drift Dynamically use the JSON structure as stored as-is, to maintaining flexibility. There is no need to define schema in the source or sink datasets.
This is Optional: If additional transformations are required:

Use Azure Functions to preprocess large JSON files.
Key Features:
- Parse the nested JSON.
- Flatten or reformat data dynamically.
- Write the processed data to another blob storage or directly to Azure SQL.

You will need to optimize for Large Files, for extremely large files:

Split Files: Use a pre-processing step (Azure Logic Apps or Azure Functions) to split large JSON files into smaller chunks and process chunks individually with Azure Data Factory.
Enable Parallelism: Use the Parallel Copy option in Copy Activity to process multiple files concurrently.

If the JSON files are too complex, you will need to use advanced tools for complex Nested Structures.

Use Azure Databricks for distributed processing and transformations of JSON data.

Load JSON data into a DataFrame.
Apply transformations and write back to SQL.

Use Synapse Analytics for serverless SQL to query JSON files directly in blob storage and transform the data and write to Azure SQL.

# An example of Pipeline Design
Pipeline Components:
   - Get Metadata Activity**: Retrieve file list.
   - For Each Activity**:
     - Inside loop:
       - Copy Activity:
         - Source: JSON file (entire content).
         - Sink: Azure SQL table (`json_content` + `file_name`).
SQL Table Schema:
   - Table Name: `JsonData`
   - Columns:
     - `json_content` (NVARCHAR(MAX)): Stores raw JSON.
     - `file_name` (VARCHAR): Captures the file name.

This is a code snippet for preprocessing in Azure Functions, which is an example of flattening nested JSON using Python as I described above:

import json
def flatten_json(json_obj, delimiter='_', prefix=''):
    flattened = {}
    for key, value in json_obj.items():
        if isinstance(value, dict):
            flattened.update(flatten_json(value, delimiter, f"{prefix}{key}{delimiter}"))
        elif isinstance(value, list):
            flattened[f"{prefix}{key}"] = json.dumps(value)  # Convert list to string
        else:
            flattened[f"{prefix}{key}"] = value
    return flattened
# Example usage
input_json = {
    "id": 1,
    "info": {"name": "John", "details": {"age": 30, "city": "New York"}},
    "tags": ["developer", "blogger"]
}
flattened_json = flatten_json(input_json)
print(flattened_json)

I hope this is helpful! Do not hesitate to let me know if you have any other questions.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Chandra Boorla 14,685 Reputation points Microsoft External Staff Moderator

2024-11-26T01:49:34.3566667+00:00

@A K

Just checking in to see if the above answer provided by @ Sina Salam helped.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

Loading large nested JSON files into Azure SQL db using Azure Data Factory

1 answer

Your answer