Data consistency check

azure_learner 420 Reputation points
2024-11-23T09:15:57.14+00:00

Hello, I have Parquet data saved in target or sink(blob). Since the Parquet file is not human-readable, we now need to test the data consistency with business users. What is the best way to do it? To save it again in CSV or Excel format is one way but it would be not an efficient workaround. Please help. Thank you.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,922 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Vinodh247 24,091 Reputation points MVP
    2024-11-23T10:10:31.9033333+00:00

    Hi ,

    Thanks for reaching out to Microsoft Q&A.

    Testing data consistency of Parquet files with business users requires a readable and user-friendly approach that doesn't involve inefficient workarounds like converting to CSV or Excel repeatedly. Here are efficient methods to address the problem:

    1. Leverage BI Tools for Visualization
    • Tools: Power BI, Tableau, or Excel Power Query.
    • Approach:
      • Use these tools to connect directly to the Parquet file.
        • Visualize the data in tabular or graphical formats that business users can validate.
          • These tools allow direct inspection without requiring conversion.
    1. Query Parquet Files Using SQL Interfaces
    • Tools: Azure Synapse Analytics, Azure Databricks, or Apache Spark.
    • Approach:
      • Load Parquet files into a temporary external table in Synapse, Databricks, or other SQL-compatible environments.
        • Provide business users access to query the data using SQL, which is often more comfortable for users familiar with relational databases.
          • Users can validate the data consistency through pre-defined queries or ad hoc analysis.
    1. Use Lightweight Web-Based Data Explorers
    • Tools: Data Explorer (in Azure or open-source), Jupyter Notebooks, or streamlit-based apps.
    • Approach:
      • Build or use existing tools to enable web-based exploration of Parquet data.
        • The explorer can provide search, filter, and export options for users to review data without needing specialized tools.
    1. Create Temporary Readable Outputs
    • Format: JSON (preferred over CSV for structure preservation).
    • Approach:
      • If necessary, convert the Parquet files to JSON format with a small, representative sample of the data.
        • Share JSON samples with business users using an easy-to-use viewer like an online JSON editor or reader.
    1. Use Azure Data Explorer or Synapse Notebooks
    • Tools: Synapse Studio Notebooks, Databricks Notebooks.
    • Approach:
      • Use these notebooks to load Parquet files and display the data in tabular format.
        • Share access or render the output as HTML/PDF for users to review.
    1. Leverage Custom Apps for Validation
    • Tools: Develop simple dashboards using Python (streamlit) or Power Apps.
    • Approach:
      • Build an interactive interface that reads Parquet data and lets users validate it in real-time.
        • Add search and filtering capabilities tailored to business requirements.

    Recommendations

    The most scalable and user-friendly method would be to use a BI tool like PBI or Synapse SQL with a direct connection to the Parquet data. This avoids manual data transformations and empowers users to query and validate data on demand. For more technical users, Jupyter Notebooks or SQL-based exploration would be optimal.

    Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.