Databrick pyspark partition query

Question

I am writing login in azure databrick Pyspark.

Please see attached input data and expected output. Id will be used as a partition

For every id we need to choose one record with latest date but if any id ( e.g Id = 3) contain multiple record with same latest date then all should be picked.

Please advise how to achieve this priority

Answer

Hi Pankaj Joshi,

Thanks for reaching out to Microsoft Q&A.

To achieve the desired output in Azure Databricks using PySpark, you can use the following code. The logic includes partitioning by id and selecting the latest date for each partition, while keeping all records if the latest date is duplicated.

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, max as spark_max
# Create a Spark session
spark = SparkSession.builder.appName("LatestDateSelection").getOrCreate()
# Input data
data = [
    (1, "nm1", "P12M", "2024-10-25"),
    (1, "nm1", "P1Y", "2023-01-30"),
    (2, "nm2", "P12M", "2024-10-25"),
    (3, "nm3", "P1Y", "2024-11-22"),
    (3, "nm3", "P1Y", "2024-11-22"),
    (4, "nm4", "P18M", "2024-05-22"),
    (5, "nm5", "P19M", "2024-05-22"),
]
columns = ["id", "name", "period", "date"]
# Create a DataFrame
df = spark.createDataFrame(data, columns)
# Convert the date column to a date type
df = df.withColumn("date", col("date").cast("date"))
# Find the maximum date for each id
max_date_df = df.groupBy("id").agg(spark_max("date").alias("max_date"))
# Join back with the original DataFrame to retain records with the maximum date
result_df = df.join(max_date_df, (df.id == max_date_df.id) & (df.date == max_date_df.max_date)).select(df["*"])
# Show the result
result_df.show()

Load Data: Create a DataFrame with your input data.
Group by ID: Find the maximum date for each id using groupBy and agg.
Join: Join the original DataFrame with the result of the grouped DataFrame to filter rows with the latest date.
Duplicate Handling: The join ensures that all records with the maximum date are retained, even if duplicates exist.

Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.

Share via

Databrick pyspark partition query

1 answer

Your answer