Techniques for Efficiently Processing Nested Schemas with Apache Spark

UpdatedApril 24, 2025

I am a Tech Enthusiast having 13+ years of experience in 𝐈𝐓 as a 𝐂𝐨𝐧𝐬𝐮𝐥𝐭𝐚𝐧𝐭, 𝐂𝐨𝐫𝐩𝐨𝐫𝐚𝐭𝐞 𝐓𝐫𝐚𝐢𝐧𝐞𝐫, 𝐌𝐞𝐧𝐭𝐨𝐫, with 12+ years in training and mentoring in 𝐒𝐨𝐟𝐭𝐰𝐚𝐫𝐞 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠, 𝐃𝐚𝐭𝐚 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠, 𝐓𝐞𝐬𝐭 𝐀𝐮𝐭𝐨𝐦𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐃𝐚𝐭𝐚 𝐒𝐜𝐢𝐞𝐧𝐜𝐞. I have 𝒕𝒓𝒂𝒊𝒏𝒆𝒅 𝒎𝒐𝒓𝒆 𝒕𝒉𝒂𝒏 10,000+ 𝑰𝑻 𝑷𝒓𝒐𝒇𝒆𝒔𝒔𝒊𝒐𝒏𝒂𝒍𝒔 and 𝒄𝒐𝒏𝒅𝒖𝒄𝒕𝒆𝒅 𝒎𝒐𝒓𝒆 𝒕𝒉𝒂𝒏 500+ 𝒕𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝒔𝒆𝒔𝒔𝒊𝒐𝒏𝒔 in the areas of 𝐒𝐨𝐟𝐭𝐰𝐚𝐫𝐞 𝐃𝐞𝐯𝐞𝐥𝐨𝐩𝐦𝐞𝐧𝐭, 𝐃𝐚𝐭𝐚 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠, 𝐂𝐥𝐨𝐮𝐝, 𝐃𝐚𝐭𝐚 𝐀𝐧𝐚𝐥𝐲𝐬𝐢𝐬, 𝐃𝐚𝐭𝐚 𝐕𝐢𝐬𝐮𝐚𝐥𝐢𝐳𝐚𝐭𝐢𝐨𝐧𝐬, 𝐀𝐫𝐭𝐢𝐟𝐢𝐜𝐢𝐚𝐥 𝐈𝐧𝐭𝐞𝐥𝐥𝐢𝐠𝐞𝐧𝐜𝐞 𝐚𝐧𝐝 𝐌𝐚𝐜𝐡𝐢𝐧𝐞 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠. I am interested in 𝐰𝐫𝐢𝐭𝐢𝐧𝐠 𝐛𝐥𝐨𝐠𝐬, 𝐬𝐡𝐚𝐫𝐢𝐧𝐠 𝐭𝐞𝐜𝐡𝐧𝐢𝐜𝐚𝐥 𝐤𝐧𝐨𝐰𝐥𝐞𝐝𝐠𝐞, 𝐬𝐨𝐥𝐯𝐢𝐧𝐠 𝐭𝐞𝐜𝐡𝐧𝐢𝐜𝐚𝐥 𝐢𝐬𝐬𝐮𝐞𝐬, 𝐫𝐞𝐚𝐝𝐢𝐧𝐠 𝐚𝐧𝐝 𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠 new subjects.

Part of seriesData Engineering

Apache Spark provides powerful tools for working with complex, nested data structures. In this blog, we'll explore two different approaches to handling nested schemas in PySpark.

let's consider a JSON dataset of customers, where each customer has an ID, a name (consisting of a first name and a last name), and a city. Here's an example of what the data might look like:

[
{"customer_id": 1, "fullname": {"firstname": "John", "lastname": "Doe"}, "city": "Bangalore"},
{"customer_id": 2, "fullname": {"firstname": "Jane", "lastname": "Doe"}, "city": "Mysore"},
{"customer_id": 3, "fullname": {"firstname": "Bob", "lastname": "Smith"}, "city": "Chennai"}
]

We can load this data into a DataFrame and apply the nested schema using either of the approaches described in the blog post. Here's how we can do it:

Approach 1: Using DDL Strings

ddlSchema = """
customer_id long, 
fullname struct<firstname:string,lastname:string>, 
city string
"""

df = spark
        .read
        .format("json")
        .schema(ddlSchema).load("/path/to/data.json")

Approach 2: Using StructType and StructField

from pyspark.sql.types import *

customer_schema = StructType([
StructField("customer_id", LongType()),
StructField("fullname", StructType([
StructField("firstname", StringType()),
StructField("lastname", StringType())
])),
StructField("city", StringType())
])

df = spark
        .read
        .format("json")
        .schema(customer_schema).load("/path/to/data.json")

In both cases, "/path/to/data.json" should be replaced with the actual path to your JSON file. The resulting DataFrame will have a nested schema that matches the structure of the data.

Approach 1: Using DDL Strings

The first approach involves defining the schema using a DDL (Data Definition Language) string. This is a string that specifies the structure of the data in a format similar to the one used in SQL. Here's an example:

ddlSchema = "customer_id long, fullname struct<firstname:string,lastname:string>, city string"

df = spark
            .read
            .format("json")
            .schema(ddlSchema).load("/path/to/data")

In this code, ddlSchema defines a schema with three fields: "customer_id", "fullname", and "city". The "fullname" field is a struct that contains two subfields: "firstname" and "lastname". The schema(ddlSchema) method applies this schema to the data.

Approach 2: Using StructType and StructField

The second approach involves defining the schema using StructType and StructField objects. This provides more flexibility and allows you to define more complex schemas. Here's an example:

from pyspark.sql.types import StructType, StructField, LongType, StringType

customer_schema = StructType([
StructField("customer_id", LongType()),
StructField("fullname", StructType([
StructField("firstname", StringType()),
StructField("lastname", StringType())
])),
StructField("city", StringType())
])

df = spark
        .read
        .format("json")
        .schema(customer_schema).load("/path/to/data")

In this code, customer_schema defines the same schema as before, but using StructType and StructField objects. The schema(customer_schema) method applies this schema to the data.

Both of these approaches allow you to work with nested data in Spark. The best one to use depends on your specific needs and the complexity of your data.

#apache-spark #data-engineering #big-data

40 views

Comments

Join the discussion

No comments yet. Be the first to comment.

Data Engineering

Part 17 of 32

Up next

Apache Spark Performance Boost: Essential Tips for Tuning

Apache Spark has risen as a formidable force in the big data landscape, offering speed, scalability, and flexibility. However, to unlock its full potential, it requires more than just feeding it with data. As datasets grow in size and complexity, opt...

More from this blog

ACID Properties

RDBMS works under 4 properties (ACID) Atomicity If any operation is performed on the data, either the entire transaction should be executed or should not be executed at all. Single unit of work

May 20, 20262 min read15

Key Problems Microsoft Fabric Solves

Data Silos Across Tools Problem Organizations use many separate tools for ETL (Data Factory), Warehousing (Synapse/Snowflake), Big Data (Databricks/Hadoop), Visualization (Power BI/Tableau), etc.

Mar 3, 20264 min read6

Unity Catalog vs Hive Metastore

What is Hive Metastore Legacy metadata store for tables and schemas Linked to single Databricks workspace Stores based info : table names, locations, schema Limitation No centralized security across workspaces No column level access control H...

Jul 17, 20251 min read53

Advanced Python Dependency Injection with Pydantic and FastAPI

Introduction Modern backend architectures demand modular, maintainable, and testable code. One of the cornerstones of achieving this is Dependency Injection (DI) — a software design pattern that helps decouple object creation from business logic, ma...

Jun 20, 20255 min read460

Advanced Python Dependency Injection with Pydantic and FastAPI

Building Reactive Python Apps with Async Generators and Streams

Introduction Modern applications increasingly rely on real-time data streams — from chat apps and stock tickers to IoT device feeds, real-time analytics dashboards, and webhooks. The challenge isn’t just speed, but also how to process continuous str...

Jun 20, 20255 min read165

Building Reactive Python Apps with Async Generators and Streams

Naveen P.N's Tech Blog

95 posts

Command Palette

Comments

Data Engineering

Apache Spark Performance Boost: Essential Tips for Tuning

More from this blog