Techniques for Efficiently Processing Nested Schemas with Apache Spark

I am a Tech Enthusiast having 13+ years of experience in ๐๐ as a ๐๐จ๐ง๐ฌ๐ฎ๐ฅ๐ญ๐๐ง๐ญ, ๐๐จ๐ซ๐ฉ๐จ๐ซ๐๐ญ๐ ๐๐ซ๐๐ข๐ง๐๐ซ, ๐๐๐ง๐ญ๐จ๐ซ, with 12+ years in training and mentoring in ๐๐จ๐๐ญ๐ฐ๐๐ซ๐ ๐๐ง๐ ๐ข๐ง๐๐๐ซ๐ข๐ง๐ , ๐๐๐ญ๐ ๐๐ง๐ ๐ข๐ง๐๐๐ซ๐ข๐ง๐ , ๐๐๐ฌ๐ญ ๐๐ฎ๐ญ๐จ๐ฆ๐๐ญ๐ข๐จ๐ง ๐๐ง๐ ๐๐๐ญ๐ ๐๐๐ข๐๐ง๐๐. I have ๐๐๐๐๐๐๐ ๐๐๐๐ ๐๐๐๐ 10,000+ ๐ฐ๐ป ๐ท๐๐๐๐๐๐๐๐๐๐๐๐ and ๐๐๐๐ ๐๐๐๐๐ ๐๐๐๐ ๐๐๐๐ 500+ ๐๐๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐ in the areas of ๐๐จ๐๐ญ๐ฐ๐๐ซ๐ ๐๐๐ฏ๐๐ฅ๐จ๐ฉ๐ฆ๐๐ง๐ญ, ๐๐๐ญ๐ ๐๐ง๐ ๐ข๐ง๐๐๐ซ๐ข๐ง๐ , ๐๐ฅ๐จ๐ฎ๐, ๐๐๐ญ๐ ๐๐ง๐๐ฅ๐ฒ๐ฌ๐ข๐ฌ, ๐๐๐ญ๐ ๐๐ข๐ฌ๐ฎ๐๐ฅ๐ข๐ณ๐๐ญ๐ข๐จ๐ง๐ฌ, ๐๐ซ๐ญ๐ข๐๐ข๐๐ข๐๐ฅ ๐๐ง๐ญ๐๐ฅ๐ฅ๐ข๐ ๐๐ง๐๐ ๐๐ง๐ ๐๐๐๐ก๐ข๐ง๐ ๐๐๐๐ซ๐ง๐ข๐ง๐ . I am interested in ๐ฐ๐ซ๐ข๐ญ๐ข๐ง๐ ๐๐ฅ๐จ๐ ๐ฌ, ๐ฌ๐ก๐๐ซ๐ข๐ง๐ ๐ญ๐๐๐ก๐ง๐ข๐๐๐ฅ ๐ค๐ง๐จ๐ฐ๐ฅ๐๐๐ ๐, ๐ฌ๐จ๐ฅ๐ฏ๐ข๐ง๐ ๐ญ๐๐๐ก๐ง๐ข๐๐๐ฅ ๐ข๐ฌ๐ฌ๐ฎ๐๐ฌ, ๐ซ๐๐๐๐ข๐ง๐ ๐๐ง๐ ๐ฅ๐๐๐ซ๐ง๐ข๐ง๐ new subjects.
Apache Spark provides powerful tools for working with complex, nested data structures. In this blog, we'll explore two different approaches to handling nested schemas in PySpark.
let's consider a JSON dataset of customers, where each customer has an ID, a name (consisting of a first name and a last name), and a city. Here's an example of what the data might look like:
[
{"customer_id": 1, "fullname": {"firstname": "John", "lastname": "Doe"}, "city": "Bangalore"},
{"customer_id": 2, "fullname": {"firstname": "Jane", "lastname": "Doe"}, "city": "Mysore"},
{"customer_id": 3, "fullname": {"firstname": "Bob", "lastname": "Smith"}, "city": "Chennai"}
]
We can load this data into a DataFrame and apply the nested schema using either of the approaches described in the blog post. Here's how we can do it:
Approach 1: Using DDL Strings
ddlSchema = """
customer_id long,
fullname struct<firstname:string,lastname:string>,
city string
"""
df = spark
.read
.format("json")
.schema(ddlSchema).load("/path/to/data.json")
Approach 2: Using StructType and StructField
from pyspark.sql.types import *
customer_schema = StructType([
StructField("customer_id", LongType()),
StructField("fullname", StructType([
StructField("firstname", StringType()),
StructField("lastname", StringType())
])),
StructField("city", StringType())
])
df = spark
.read
.format("json")
.schema(customer_schema).load("/path/to/data.json")
In both cases, "/path/to/data.json" should be replaced with the actual path to your JSON file. The resulting DataFrame will have a nested schema that matches the structure of the data.
Approach 1: Using DDL Strings
The first approach involves defining the schema using a DDL (Data Definition Language) string. This is a string that specifies the structure of the data in a format similar to the one used in SQL. Here's an example:
ddlSchema = "customer_id long, fullname struct<firstname:string,lastname:string>, city string"
df = spark
.read
.format("json")
.schema(ddlSchema).load("/path/to/data")
In this code, ddlSchema defines a schema with three fields: "customer_id", "fullname", and "city". The "fullname" field is a struct that contains two subfields: "firstname" and "lastname". The schema(ddlSchema) method applies this schema to the data.
Approach 2: Using StructType and StructField
The second approach involves defining the schema using StructType and StructField objects. This provides more flexibility and allows you to define more complex schemas. Here's an example:
from pyspark.sql.types import StructType, StructField, LongType, StringType
customer_schema = StructType([
StructField("customer_id", LongType()),
StructField("fullname", StructType([
StructField("firstname", StringType()),
StructField("lastname", StringType())
])),
StructField("city", StringType())
])
df = spark
.read
.format("json")
.schema(customer_schema).load("/path/to/data")
In this code, customer_schema defines the same schema as before, but using StructType and StructField objects. The schema(customer_schema) method applies this schema to the data.
Both of these approaches allow you to work with nested data in Spark. The best one to use depends on your specific needs and the complexity of your data.



