Skip to main content

Command Palette

Search for a command to run...

Mastering Date Manipulation in PySpark

Updated
โ€ข4 min read
Mastering Date Manipulation in PySpark
N

I am a Tech Enthusiast having 13+ years of experience in ๐ˆ๐“ as a ๐‚๐จ๐ง๐ฌ๐ฎ๐ฅ๐ญ๐š๐ง๐ญ, ๐‚๐จ๐ซ๐ฉ๐จ๐ซ๐š๐ญ๐ž ๐“๐ซ๐š๐ข๐ง๐ž๐ซ, ๐Œ๐ž๐ง๐ญ๐จ๐ซ, with 12+ years in training and mentoring in ๐’๐จ๐Ÿ๐ญ๐ฐ๐š๐ซ๐ž ๐„๐ง๐ ๐ข๐ง๐ž๐ž๐ซ๐ข๐ง๐ , ๐ƒ๐š๐ญ๐š ๐„๐ง๐ ๐ข๐ง๐ž๐ž๐ซ๐ข๐ง๐ , ๐“๐ž๐ฌ๐ญ ๐€๐ฎ๐ญ๐จ๐ฆ๐š๐ญ๐ข๐จ๐ง ๐š๐ง๐ ๐ƒ๐š๐ญ๐š ๐’๐œ๐ข๐ž๐ง๐œ๐ž. I have ๐’•๐’“๐’‚๐’Š๐’๐’†๐’… ๐’Ž๐’๐’“๐’† ๐’•๐’‰๐’‚๐’ 10,000+ ๐‘ฐ๐‘ป ๐‘ท๐’“๐’๐’‡๐’†๐’”๐’”๐’Š๐’๐’๐’‚๐’๐’” and ๐’„๐’๐’๐’…๐’–๐’„๐’•๐’†๐’… ๐’Ž๐’๐’“๐’† ๐’•๐’‰๐’‚๐’ 500+ ๐’•๐’“๐’‚๐’Š๐’๐’Š๐’๐’ˆ ๐’”๐’†๐’”๐’”๐’Š๐’๐’๐’” in the areas of ๐’๐จ๐Ÿ๐ญ๐ฐ๐š๐ซ๐ž ๐ƒ๐ž๐ฏ๐ž๐ฅ๐จ๐ฉ๐ฆ๐ž๐ง๐ญ, ๐ƒ๐š๐ญ๐š ๐„๐ง๐ ๐ข๐ง๐ž๐ž๐ซ๐ข๐ง๐ , ๐‚๐ฅ๐จ๐ฎ๐, ๐ƒ๐š๐ญ๐š ๐€๐ง๐š๐ฅ๐ฒ๐ฌ๐ข๐ฌ, ๐ƒ๐š๐ญ๐š ๐•๐ข๐ฌ๐ฎ๐š๐ฅ๐ข๐ณ๐š๐ญ๐ข๐จ๐ง๐ฌ, ๐€๐ซ๐ญ๐ข๐Ÿ๐ข๐œ๐ข๐š๐ฅ ๐ˆ๐ง๐ญ๐ž๐ฅ๐ฅ๐ข๐ ๐ž๐ง๐œ๐ž ๐š๐ง๐ ๐Œ๐š๐œ๐ก๐ข๐ง๐ž ๐‹๐ž๐š๐ซ๐ง๐ข๐ง๐ . I am interested in ๐ฐ๐ซ๐ข๐ญ๐ข๐ง๐  ๐›๐ฅ๐จ๐ ๐ฌ, ๐ฌ๐ก๐š๐ซ๐ข๐ง๐  ๐ญ๐ž๐œ๐ก๐ง๐ข๐œ๐š๐ฅ ๐ค๐ง๐จ๐ฐ๐ฅ๐ž๐๐ ๐ž, ๐ฌ๐จ๐ฅ๐ฏ๐ข๐ง๐  ๐ญ๐ž๐œ๐ก๐ง๐ข๐œ๐š๐ฅ ๐ข๐ฌ๐ฌ๐ฎ๐ž๐ฌ, ๐ซ๐ž๐š๐๐ข๐ง๐  ๐š๐ง๐ ๐ฅ๐ž๐š๐ซ๐ง๐ข๐ง๐  new subjects.

Time-based data processing is a critical aspect of data engineering, and PySpark provides a rich set of functions to handle date and time efficiently.

1. Extracting Year from a Date Column

Problem: Extract the year from the column event_date.

Solution

from pyspark.sql import SparkSession
from pyspark.sql.functions import year

spark = SparkSession.builder.appName("ExtractYear").getOrCreate()

data = [("2023-04-15",), ("2022-11-30",), ("2021-08-25",)]
df = spark.createDataFrame(data, ["event_date"])

df = df.withColumn("year", year(df.event_date))
df.show()

2. Date Difference Calculation

Problem: Calculate the difference in days between start_date and end_date.

Solution

from pyspark.sql.functions import datediff, to_date

data = [("2023-01-01", "2023-02-01"), ("2023-03-15", "2023-03-20")]
df = spark.createDataFrame(data, ["start_date", "end_date"])

df = df.withColumn("start_date", to_date(df.start_date, "yyyy-MM-dd"))
df = df.withColumn("end_date", to_date(df.end_date, "yyyy-MM-dd"))
df = df.withColumn("date_diff", datediff(df.end_date, df.start_date))

df.show()

3. Filter Records Based on Date

Problem: Filter records where event_date is after 2023-06-01.

Solution

from pyspark.sql.functions import col

data = [("2023-05-15",), ("2023-07-20",), ("2023-06-05",)]
df = spark.createDataFrame(data, ["event_date"])

df_filtered = df.filter(col("event_date") > "2023-06-01")
df_filtered.show()

4. Add Days to a Date

Problem: Add 30 days to order_date.

Solution

from pyspark.sql.functions import date_add

data = [("2023-01-10",), ("2023-06-15",), ("2023-07-30",)]
df = spark.createDataFrame(data, ["order_date"])

df = df.withColumn("new_date", date_add(df.order_date, 30))
df.show()

5. Find the Maximum Date

Problem: Determine the latest date from payment_date.

Solution

from pyspark.sql.functions import max

data = [("2023-02-15",), ("2023-06-25",), ("2023-01-10",)]
df = spark.createDataFrame(data, ["payment_date"])

df_max = df.agg(max("payment_date").alias("max_date"))
df_max.show()

6. Truncate Date to First Day of Month

Problem: Truncate sale_date to the first day of its respective month.

Solution

from pyspark.sql.functions import trunc

data = [("2023-04-12",), ("2023-07-23",), ("2023-08-05",)]
df = spark.createDataFrame(data, ["sale_date"])

df = df.withColumn("first_day_of_month", trunc(df.sale_date, "MM"))
df.show()

7. Group by Year

Problem: Group records by year extracted from transaction_date.

Solution

df = df.withColumn("year", year(df.transaction_date))
df.groupBy("year").count().show()

8. Filter Records Within a Date Range

Problem: Filter records where visit_date is between 2023-01-01 and 2023-05-01.

Solution

df_filtered = df.filter((col("visit_date") >= "2023-01-01") & (col("visit_date") <= "2023-05-01"))
df_filtered.show()

9. Extract Day of the Week

Problem: Extract the day of the week from attendance_date.

Solution

from pyspark.sql.functions import date_format

df = df.withColumn("day_of_week", date_format(df.attendance_date, "EEEE"))
df.show()

10. Check Leap Year

Problem: Identify if birth_date falls in a leap year.

Solution

from pyspark.sql.functions import year

df = df.withColumn("is_leap_year", (year(df.birth_date) % 4 == 0) & ((year(df.birth_date) % 100 != 0) | (year(df.birth_date) % 400 == 0)))
df.show()

11. Convert String to Date Format

Problem: Convert arrival_time from dd-MM-yyyy to yyyy-MM-dd.

Solution

from pyspark.sql.functions import to_date

df = df.withColumn("formatted_date", to_date(df.arrival_time, "dd-MM-yyyy"))
df.show()

12. Calculate Week Number

Problem: For each shipment_date, calculate the week number.

Solution

from pyspark.sql.functions import weekofyear

df = df.withColumn("week_number", weekofyear(df.shipment_date))
df.show()

13. Find Records from the Last 7 Days

Problem: Identify all records where log_date is within the last 7 days.

Solution

from pyspark.sql.functions import current_date

df_filtered = df.filter(col("log_date") >= date_add(current_date(), -7))
df_filtered.show()

14. Format Date as String

Problem: Format booking_date as dd/MM/yyyy.

Solution

df = df.withColumn("formatted_date", date_format(df.booking_date, "dd/MM/yyyy"))
df.show()

15. Find the First and Last Record by Date

Problem: Find the first and last record based on created_at.

Solution

from pyspark.sql.functions import min, max

df_min_max = df.agg(min("created_at").alias("first_record"), max("created_at").alias("last_record"))
df_min_max.show()

16. Difference Between Dates in Months

Problem: Calculate the difference between two dates in months.

Solution

from pyspark.sql.functions import months_between

df = df.withColumn("month_diff", months_between(df.end_month, df.start_month))
df.show()

17. Convert UTC to Local Time

Problem: Convert utc_timestamp from UTC to IST.

Solution

from pyspark.sql.functions import from_utc_timestamp

df = df.withColumn("ist_time", from_utc_timestamp(df.utc_timestamp, "Asia/Kolkata"))
df.show()

18. Find Holidays

Problem: Check if holiday_date is a public holiday.

Solution

pythonCopyEditholiday_list = ["2023-01-01", "2023-12-25"]
df = df.withColumn("is_holiday", col("holiday_date").isin(holiday_list))
df.show()

19. Round Time to Nearest Hour

Problem: Round meeting_time to the nearest hour.

Solution

from pyspark.sql.functions import date_trunc

df = df.withColumn("rounded_time", date_trunc("hour", df.meeting_time))
df.show()

20. Extract Quarter

Problem: Extract the quarter of the year from invoice_date.

Solution

from pyspark.sql.functions import quarter

df = df.withColumn("quarter", quarter(df.invoice_date))
df.show()

More from this blog

Naveen P.N's Tech Blog

95 posts