Mastering Date Manipulation in PySpark

Time-based data processing is a critical aspect of data engineering, and PySpark provides a rich set of functions to handle date and time efficiently.
1. Extracting Year from a Date Column
Problem: Extract the year from the column event_date.
Solution
from pyspark.sql import SparkSession
from pyspark.sql.functions import year
spark = SparkSession.builder.appName("ExtractYear").getOrCreate()
data = [("2023-04-15",), ("2022-11-30",), ("2021-08-25",)]
df = spark.createDataFrame(data, ["event_date"])
df = df.withColumn("year", year(df.event_date))
df.show()
2. Date Difference Calculation
Problem: Calculate the difference in days between start_date and end_date.
Solution
from pyspark.sql.functions import datediff, to_date
data = [("2023-01-01", "2023-02-01"), ("2023-03-15", "2023-03-20")]
df = spark.createDataFrame(data, ["start_date", "end_date"])
df = df.withColumn("start_date", to_date(df.start_date, "yyyy-MM-dd"))
df = df.withColumn("end_date", to_date(df.end_date, "yyyy-MM-dd"))
df = df.withColumn("date_diff", datediff(df.end_date, df.start_date))
df.show()
3. Filter Records Based on Date
Problem: Filter records where event_date is after 2023-06-01.
Solution
from pyspark.sql.functions import col
data = [("2023-05-15",), ("2023-07-20",), ("2023-06-05",)]
df = spark.createDataFrame(data, ["event_date"])
df_filtered = df.filter(col("event_date") > "2023-06-01")
df_filtered.show()
4. Add Days to a Date
Problem: Add 30 days to order_date.
Solution
from pyspark.sql.functions import date_add
data = [("2023-01-10",), ("2023-06-15",), ("2023-07-30",)]
df = spark.createDataFrame(data, ["order_date"])
df = df.withColumn("new_date", date_add(df.order_date, 30))
df.show()
5. Find the Maximum Date
Problem: Determine the latest date from payment_date.
Solution
from pyspark.sql.functions import max
data = [("2023-02-15",), ("2023-06-25",), ("2023-01-10",)]
df = spark.createDataFrame(data, ["payment_date"])
df_max = df.agg(max("payment_date").alias("max_date"))
df_max.show()
6. Truncate Date to First Day of Month
Problem: Truncate sale_date to the first day of its respective month.
Solution
from pyspark.sql.functions import trunc
data = [("2023-04-12",), ("2023-07-23",), ("2023-08-05",)]
df = spark.createDataFrame(data, ["sale_date"])
df = df.withColumn("first_day_of_month", trunc(df.sale_date, "MM"))
df.show()
7. Group by Year
Problem: Group records by year extracted from transaction_date.
Solution
df = df.withColumn("year", year(df.transaction_date))
df.groupBy("year").count().show()
8. Filter Records Within a Date Range
Problem: Filter records where visit_date is between 2023-01-01 and 2023-05-01.
Solution
df_filtered = df.filter((col("visit_date") >= "2023-01-01") & (col("visit_date") <= "2023-05-01"))
df_filtered.show()
9. Extract Day of the Week
Problem: Extract the day of the week from attendance_date.
Solution
from pyspark.sql.functions import date_format
df = df.withColumn("day_of_week", date_format(df.attendance_date, "EEEE"))
df.show()
10. Check Leap Year
Problem: Identify if birth_date falls in a leap year.
Solution
from pyspark.sql.functions import year
df = df.withColumn("is_leap_year", (year(df.birth_date) % 4 == 0) & ((year(df.birth_date) % 100 != 0) | (year(df.birth_date) % 400 == 0)))
df.show()
11. Convert String to Date Format
Problem: Convert arrival_time from dd-MM-yyyy to yyyy-MM-dd.
Solution
from pyspark.sql.functions import to_date
df = df.withColumn("formatted_date", to_date(df.arrival_time, "dd-MM-yyyy"))
df.show()
12. Calculate Week Number
Problem: For each shipment_date, calculate the week number.
Solution
from pyspark.sql.functions import weekofyear
df = df.withColumn("week_number", weekofyear(df.shipment_date))
df.show()
13. Find Records from the Last 7 Days
Problem: Identify all records where log_date is within the last 7 days.
Solution
from pyspark.sql.functions import current_date
df_filtered = df.filter(col("log_date") >= date_add(current_date(), -7))
df_filtered.show()
14. Format Date as String
Problem: Format booking_date as dd/MM/yyyy.
Solution
df = df.withColumn("formatted_date", date_format(df.booking_date, "dd/MM/yyyy"))
df.show()
15. Find the First and Last Record by Date
Problem: Find the first and last record based on created_at.
Solution
from pyspark.sql.functions import min, max
df_min_max = df.agg(min("created_at").alias("first_record"), max("created_at").alias("last_record"))
df_min_max.show()
16. Difference Between Dates in Months
Problem: Calculate the difference between two dates in months.
Solution
from pyspark.sql.functions import months_between
df = df.withColumn("month_diff", months_between(df.end_month, df.start_month))
df.show()
17. Convert UTC to Local Time
Problem: Convert utc_timestamp from UTC to IST.
Solution
from pyspark.sql.functions import from_utc_timestamp
df = df.withColumn("ist_time", from_utc_timestamp(df.utc_timestamp, "Asia/Kolkata"))
df.show()
18. Find Holidays
Problem: Check if holiday_date is a public holiday.
Solution
pythonCopyEditholiday_list = ["2023-01-01", "2023-12-25"]
df = df.withColumn("is_holiday", col("holiday_date").isin(holiday_list))
df.show()
19. Round Time to Nearest Hour
Problem: Round meeting_time to the nearest hour.
Solution
from pyspark.sql.functions import date_trunc
df = df.withColumn("rounded_time", date_trunc("hour", df.meeting_time))
df.show()
20. Extract Quarter
Problem: Extract the quarter of the year from invoice_date.
Solution
from pyspark.sql.functions import quarter
df = df.withColumn("quarter", quarter(df.invoice_date))
df.show()



