This program demonstrates a real-time data pipeline using Spark Structured Streaming to handle deduplication and data quality enforcement on streaming data from CSV files.

Objective

The program achieves the following:

Ingest data from a directory containing CSV files.
Deduplicate records based on unique identifiers (event_id) and keep the latest record within a specified time window.
Enforce data quality rules by filtering and handling invalid or missing data.
Output processed data to the console for debugging and verification.

Code Walkthrough

Reading Streams

Spark Session Initialization

The program starts by initializing a SparkSession, which is the entry point for Spark Structured Streaming applications.

spark = SparkSession.builder \
    .appName("Streaming Deduplication and Quality Enforcement") \
    .master("local") \
    .getOrCreate()

Schema Definition

A custom schema is defined to specify the structure of the incoming CSV files:

CUSTOM_SCHEMA = StructType([
    StructField("event_id", StringType(), True),    # Unique event identifier
    StructField("name", StringType(), True),       # Name of the entity
    StructField("timestamp", TimestampType(), True),  # Event timestamp
    StructField("value", IntegerType(), True)      # Numeric value for the event
])

Reading Streaming Data

The program reads streaming data from a folder (../resources/dataset/data_quality/input) using the CSV format:

streaming_df = spark.readStream \
    .format("csv") \
    .schema(CUSTOM_SCHEMA) \
    .load("../resources/dataset/data_quality/input")

Deduplication Using Time-Based Windowing

Deduplication is achieved by grouping records using a time-based window and event_id. Records with the latest timestamp are kept.

Add Watermark and Grouping

streaming_df.withWatermark("timestamp", "10 minutes").groupBy(
    window(col("timestamp"), "10 minutes"),  # Group by 10-minute time windows
    col("event_id")                         # Group by event_id
)

Aggregate Functions

.agg(
    max("timestamp").alias("latest_timestamp"),  # Latest timestamp per group
    max("value").alias("value"),                # Latest value per group
    first("name").alias("name")                 # First non-null name
)

Data Quality Enforcement

Data quality is enforced by applying filters and handling null values:

quality_enforced_df = deduplicated_df \
    .filter((col("value").isNotNull()) & (col("value") > 0)) \
    .fillna({"name": "Unknown"})  # Replace null names

Writing to the Console

The processed data is written to the console using the writeStream method:

quality_enforced_df.writeStream \
    .format("console") \
    .outputMode("complete") \
    .start() \
    .awaitTermination()

Input Data

Input

event_id,name,timestamp,value
1,John,2024-12-10 10:00:00,50
1,John,2024-12-10 10:05:00,60
2,Alice,2024-12-10 10:03:00,30
3,,2024-12-10 10:10:00,-10

Output

+--------+-------------------+-----+-------+
|event_id|timestamp          |value|name   |
+--------+-------------------+-----+-------+
|1       |2024-12-10 10:05:00|60   |John   |
|2       |2024-12-10 10:03:00|30   |Alice  |
+--------+-------------------+-----+-------+

Complete Code

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

if __name__ == '__main__':
    # Create Spark session
    spark = SparkSession.builder \
        .appName("Streaming Deduplication and Quality Enforcement") \
        .master("local") \
        .getOrCreate()

    # Define custom schema
    CUSTOM_SCHEMA = StructType([
        StructField("event_id", StringType(), True),    # Unique event identifier
        StructField("name", StringType(), True),       # Name of the entity
        StructField("timestamp", TimestampType(), True),  # Event timestamp
        StructField("value", IntegerType(), True)      # Numeric value for the event
    ])

    # Source: Read streaming data from CSV files
    streaming_df = spark.readStream \
        .format("csv") \
        .schema(CUSTOM_SCHEMA) \
        .load("../resources/dataset/data_quality/input")

    # Deduplicate using time-based windowing
    deduplicated_df = streaming_df.withWatermark("timestamp", "10 minutes") \
        .groupBy(
            window(col("timestamp"), "10 minutes"),  # Group by time window
            col("event_id")                         # Group by event_id
        ) \
        .agg(
            max("timestamp").alias("latest_timestamp"),  # Keep the latest timestamp
            max("value").alias("value"),                # Keep the latest value
            first("name").alias("name")                 # Keep the first non-null name
        ) \
        .select(
            col("event_id"),
            col("latest_timestamp").alias("timestamp"),
            col("value"),
            col("name")
        )

    # Enforce data quality rules
    quality_enforced_df = deduplicated_df \
        .filter((col("value").isNotNull()) & (col("value") > 0)) \
        .fillna({"name": "Unknown"})  # Replace null names

    # Sink: Write the results to the console
    quality_enforced_df.writeStream.start(format="console",
                                          outputMode="complete").awaitTermination()

Streaming Deduplication and Quality Enforcement

Objective

Code Walkthrough

Reading Streams

Spark Session Initialization

Schema Definition

Reading Streaming Data

Deduplication Using Time-Based Windowing

Data Quality Enforcement

Input Data

Complete Code

Comments

More from this blog

ACID Properties

Key Problems Microsoft Fabric Solves

Unity Catalog vs Hive Metastore

Advanced Python Dependency Injection with Pydantic and FastAPI

Building Reactive Python Apps with Async Generators and Streams

Command Palette

Objective

Code Walkthrough

Reading Streams

Spark Session Initialization

Schema Definition

Reading Streaming Data

Deduplication Using Time-Based Windowing

Data Quality Enforcement

Input Data

Complete Code

Comments

More from this blog