Databrick pyspark partition query

Pankaj Joshi 331 Reputation points
2024-11-23T04:47:09.25+00:00

I am writing login in azure databrick Pyspark.

Please see attached input data and expected output.  Id will be used as a partition

For every id we need to choose one record with latest date but if any id ( e.g  Id = 3) contain multiple record with same latest date then all should be picked.

Please advise how to achieve thispriority

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,240 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Vinodh247 24,091 Reputation points MVP
    2024-11-23T10:15:53.31+00:00

    Hi Pankaj Joshi,

    Thanks for reaching out to Microsoft Q&A.

    To achieve the desired output in Azure Databricks using PySpark, you can use the following code. The logic includes partitioning by id and selecting the latest date for each partition, while keeping all records if the latest date is duplicated.

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import col, max as spark_max
    # Create a Spark session
    spark = SparkSession.builder.appName("LatestDateSelection").getOrCreate()
    # Input data
    data = [
        (1, "nm1", "P12M", "2024-10-25"),
        (1, "nm1", "P1Y", "2023-01-30"),
        (2, "nm2", "P12M", "2024-10-25"),
        (3, "nm3", "P1Y", "2024-11-22"),
        (3, "nm3", "P1Y", "2024-11-22"),
        (4, "nm4", "P18M", "2024-05-22"),
        (5, "nm5", "P19M", "2024-05-22"),
    ]
    columns = ["id", "name", "period", "date"]
    # Create a DataFrame
    df = spark.createDataFrame(data, columns)
    # Convert the date column to a date type
    df = df.withColumn("date", col("date").cast("date"))
    # Find the maximum date for each id
    max_date_df = df.groupBy("id").agg(spark_max("date").alias("max_date"))
    # Join back with the original DataFrame to retain records with the maximum date
    result_df = df.join(max_date_df, (df.id == max_date_df.id) & (df.date == max_date_df.max_date)).select(df["*"])
    # Show the result
    result_df.show()
    
    
    
    1. Load Data: Create a DataFrame with your input data.
    2. Group by ID: Find the maximum date for each id using groupBy and agg.
    3. Join: Join the original DataFrame with the result of the grouped DataFrame to filter rows with the latest date.
    4. Duplicate Handling: The join ensures that all records with the maximum date are retained, even if duplicates exist.

    Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.