Hi Pankaj Joshi,
Thanks for reaching out to Microsoft Q&A.
To achieve the desired output in Azure Databricks using PySpark, you can use the following code. The logic includes partitioning by id
and selecting the latest date
for each partition, while keeping all records if the latest date is duplicated.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, max as spark_max
# Create a Spark session
spark = SparkSession.builder.appName("LatestDateSelection").getOrCreate()
# Input data
data = [
(1, "nm1", "P12M", "2024-10-25"),
(1, "nm1", "P1Y", "2023-01-30"),
(2, "nm2", "P12M", "2024-10-25"),
(3, "nm3", "P1Y", "2024-11-22"),
(3, "nm3", "P1Y", "2024-11-22"),
(4, "nm4", "P18M", "2024-05-22"),
(5, "nm5", "P19M", "2024-05-22"),
]
columns = ["id", "name", "period", "date"]
# Create a DataFrame
df = spark.createDataFrame(data, columns)
# Convert the date column to a date type
df = df.withColumn("date", col("date").cast("date"))
# Find the maximum date for each id
max_date_df = df.groupBy("id").agg(spark_max("date").alias("max_date"))
# Join back with the original DataFrame to retain records with the maximum date
result_df = df.join(max_date_df, (df.id == max_date_df.id) & (df.date == max_date_df.max_date)).select(df["*"])
# Show the result
result_df.show()
- Load Data: Create a DataFrame with your input data.
- Group by ID: Find the maximum
date
for eachid
usinggroupBy
andagg
. - Join: Join the original DataFrame with the result of the grouped DataFrame to filter rows with the latest date.
- Duplicate Handling: The join ensures that all records with the maximum date are retained, even if duplicates exist.
Please feel free to click the 'Upvote' (Thumbs-up) button and 'Accept as Answer'. This helps the community by allowing others with similar queries to easily find the solution.