Hello A K,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that you have challenges in loading large nested JSON files into Azure SQL DB using Azure Data Factory.
In the below steps will help to resolve the issues described and is optimized for scalability, schema drift, and large nested JSON structures.
- Efficiently loads large JSON files into Azure SQL without the need for flattening by using Copy Activity for Raw JSON Ingestion:
- Configure the dataset to read the entire JSON file as a string.
- Set format to
JSON
. - Use
wildcard file path
if processing multiple files. - Map the JSON string to a single column (e.g.,
json_content
) in a SQL table. - Use the Copy Data Activity's additional column feature to add file name as metadata.
- For an example:
- Add a column in the sink dataset schema (e.g.,
file_name
). - Map the file name dynamically using the system variable
@item().Name
.
- Schema drift is inherently supported by the Copy Activity when storing raw JSON in a single column, to handle Schema Drift Dynamically use the JSON structure as stored as-is, to maintaining flexibility. There is no need to define schema in the source or sink datasets.
- This is Optional: If additional transformations are required:
- Use Azure Functions to preprocess large JSON files.
- Key Features:
- Parse the nested JSON.
- Flatten or reformat data dynamically.
- Write the processed data to another blob storage or directly to Azure SQL.
- You will need to optimize for Large Files, for extremely large files:
- Split Files: Use a pre-processing step (Azure Logic Apps or Azure Functions) to split large JSON files into smaller chunks and process chunks individually with Azure Data Factory.
- Enable Parallelism: Use the Parallel Copy option in Copy Activity to process multiple files concurrently.
- If the JSON files are too complex, you will need to use advanced tools for complex Nested Structures.
Use Azure Databricks for distributed processing and transformations of JSON data.
- Load JSON data into a DataFrame.
- Apply transformations and write back to SQL.
Use Synapse Analytics for serverless SQL to query JSON files directly in blob storage and transform the data and write to Azure SQL.
# An example of Pipeline Design
Pipeline Components:
- Get Metadata Activity**: Retrieve file list.
- For Each Activity**:
- Inside loop:
- Copy Activity:
- Source: JSON file (entire content).
- Sink: Azure SQL table (`json_content` + `file_name`).
SQL Table Schema:
- Table Name: `JsonData`
- Columns:
- `json_content` (NVARCHAR(MAX)): Stores raw JSON.
- `file_name` (VARCHAR): Captures the file name.
This is a code snippet for preprocessing in Azure Functions, which is an example of flattening nested JSON using Python as I described above:
import json
def flatten_json(json_obj, delimiter='_', prefix=''):
flattened = {}
for key, value in json_obj.items():
if isinstance(value, dict):
flattened.update(flatten_json(value, delimiter, f"{prefix}{key}{delimiter}"))
elif isinstance(value, list):
flattened[f"{prefix}{key}"] = json.dumps(value) # Convert list to string
else:
flattened[f"{prefix}{key}"] = value
return flattened
# Example usage
input_json = {
"id": 1,
"info": {"name": "John", "details": {"age": 30, "city": "New York"}},
"tags": ["developer", "blogger"]
}
flattened_json = flatten_json(input_json)
print(flattened_json)
I hope this is helpful! Do not hesitate to let me know if you have any other questions.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.