Skip to main content

Command Palette

Search for a command to run...

Mastering Date Manipulation in PySpark

Updated
4 min read
Mastering Date Manipulation in PySpark

Time-based data processing is a critical aspect of data engineering, and PySpark provides a rich set of functions to handle date and time efficiently.

1. Extracting Year from a Date Column

Problem: Extract the year from the column event_date.

Solution

from pyspark.sql import SparkSession
from pyspark.sql.functions import year

spark = SparkSession.builder.appName("ExtractYear").getOrCreate()

data = [("2023-04-15",), ("2022-11-30",), ("2021-08-25",)]
df = spark.createDataFrame(data, ["event_date"])

df = df.withColumn("year", year(df.event_date))
df.show()

2. Date Difference Calculation

Problem: Calculate the difference in days between start_date and end_date.

Solution

from pyspark.sql.functions import datediff, to_date

data = [("2023-01-01", "2023-02-01"), ("2023-03-15", "2023-03-20")]
df = spark.createDataFrame(data, ["start_date", "end_date"])

df = df.withColumn("start_date", to_date(df.start_date, "yyyy-MM-dd"))
df = df.withColumn("end_date", to_date(df.end_date, "yyyy-MM-dd"))
df = df.withColumn("date_diff", datediff(df.end_date, df.start_date))

df.show()

3. Filter Records Based on Date

Problem: Filter records where event_date is after 2023-06-01.

Solution

from pyspark.sql.functions import col

data = [("2023-05-15",), ("2023-07-20",), ("2023-06-05",)]
df = spark.createDataFrame(data, ["event_date"])

df_filtered = df.filter(col("event_date") > "2023-06-01")
df_filtered.show()

4. Add Days to a Date

Problem: Add 30 days to order_date.

Solution

from pyspark.sql.functions import date_add

data = [("2023-01-10",), ("2023-06-15",), ("2023-07-30",)]
df = spark.createDataFrame(data, ["order_date"])

df = df.withColumn("new_date", date_add(df.order_date, 30))
df.show()

5. Find the Maximum Date

Problem: Determine the latest date from payment_date.

Solution

from pyspark.sql.functions import max

data = [("2023-02-15",), ("2023-06-25",), ("2023-01-10",)]
df = spark.createDataFrame(data, ["payment_date"])

df_max = df.agg(max("payment_date").alias("max_date"))
df_max.show()

6. Truncate Date to First Day of Month

Problem: Truncate sale_date to the first day of its respective month.

Solution

from pyspark.sql.functions import trunc

data = [("2023-04-12",), ("2023-07-23",), ("2023-08-05",)]
df = spark.createDataFrame(data, ["sale_date"])

df = df.withColumn("first_day_of_month", trunc(df.sale_date, "MM"))
df.show()

7. Group by Year

Problem: Group records by year extracted from transaction_date.

Solution

df = df.withColumn("year", year(df.transaction_date))
df.groupBy("year").count().show()

8. Filter Records Within a Date Range

Problem: Filter records where visit_date is between 2023-01-01 and 2023-05-01.

Solution

df_filtered = df.filter((col("visit_date") >= "2023-01-01") & (col("visit_date") <= "2023-05-01"))
df_filtered.show()

9. Extract Day of the Week

Problem: Extract the day of the week from attendance_date.

Solution

from pyspark.sql.functions import date_format

df = df.withColumn("day_of_week", date_format(df.attendance_date, "EEEE"))
df.show()

10. Check Leap Year

Problem: Identify if birth_date falls in a leap year.

Solution

from pyspark.sql.functions import year

df = df.withColumn("is_leap_year", (year(df.birth_date) % 4 == 0) & ((year(df.birth_date) % 100 != 0) | (year(df.birth_date) % 400 == 0)))
df.show()

11. Convert String to Date Format

Problem: Convert arrival_time from dd-MM-yyyy to yyyy-MM-dd.

Solution

from pyspark.sql.functions import to_date

df = df.withColumn("formatted_date", to_date(df.arrival_time, "dd-MM-yyyy"))
df.show()

12. Calculate Week Number

Problem: For each shipment_date, calculate the week number.

Solution

from pyspark.sql.functions import weekofyear

df = df.withColumn("week_number", weekofyear(df.shipment_date))
df.show()

13. Find Records from the Last 7 Days

Problem: Identify all records where log_date is within the last 7 days.

Solution

from pyspark.sql.functions import current_date

df_filtered = df.filter(col("log_date") >= date_add(current_date(), -7))
df_filtered.show()

14. Format Date as String

Problem: Format booking_date as dd/MM/yyyy.

Solution

df = df.withColumn("formatted_date", date_format(df.booking_date, "dd/MM/yyyy"))
df.show()

15. Find the First and Last Record by Date

Problem: Find the first and last record based on created_at.

Solution

from pyspark.sql.functions import min, max

df_min_max = df.agg(min("created_at").alias("first_record"), max("created_at").alias("last_record"))
df_min_max.show()

16. Difference Between Dates in Months

Problem: Calculate the difference between two dates in months.

Solution

from pyspark.sql.functions import months_between

df = df.withColumn("month_diff", months_between(df.end_month, df.start_month))
df.show()

17. Convert UTC to Local Time

Problem: Convert utc_timestamp from UTC to IST.

Solution

from pyspark.sql.functions import from_utc_timestamp

df = df.withColumn("ist_time", from_utc_timestamp(df.utc_timestamp, "Asia/Kolkata"))
df.show()

18. Find Holidays

Problem: Check if holiday_date is a public holiday.

Solution

pythonCopyEditholiday_list = ["2023-01-01", "2023-12-25"]
df = df.withColumn("is_holiday", col("holiday_date").isin(holiday_list))
df.show()

19. Round Time to Nearest Hour

Problem: Round meeting_time to the nearest hour.

Solution

from pyspark.sql.functions import date_trunc

df = df.withColumn("rounded_time", date_trunc("hour", df.meeting_time))
df.show()

20. Extract Quarter

Problem: Extract the quarter of the year from invoice_date.

Solution

from pyspark.sql.functions import quarter

df = df.withColumn("quarter", quarter(df.invoice_date))
df.show()

More from this blog

Naveen P.N's Tech Blog

94 posts