Processing Multi Char Delimiter Dataset using PySpark

UpdatedApril 17, 2025

•1 min read

Processing Multi Char Delimiter Dataset using PySpark

I am a Tech Enthusiast having 13+ years of experience in 𝐈𝐓 as a 𝐂𝐨𝐧𝐬𝐮𝐥𝐭𝐚𝐧𝐭, 𝐂𝐨𝐫𝐩𝐨𝐫𝐚𝐭𝐞 𝐓𝐫𝐚𝐢𝐧𝐞𝐫, 𝐌𝐞𝐧𝐭𝐨𝐫, with 12+ years in training and mentoring in 𝐒𝐨𝐟𝐭𝐰𝐚𝐫𝐞 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠, 𝐃𝐚𝐭𝐚 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠, 𝐓𝐞𝐬𝐭 𝐀𝐮𝐭𝐨𝐦𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐃𝐚𝐭𝐚 𝐒𝐜𝐢𝐞𝐧𝐜𝐞. I have 𝒕𝒓𝒂𝒊𝒏𝒆𝒅 𝒎𝒐𝒓𝒆 𝒕𝒉𝒂𝒏 10,000+ 𝑰𝑻 𝑷𝒓𝒐𝒇𝒆𝒔𝒔𝒊𝒐𝒏𝒂𝒍𝒔 and 𝒄𝒐𝒏𝒅𝒖𝒄𝒕𝒆𝒅 𝒎𝒐𝒓𝒆 𝒕𝒉𝒂𝒏 500+ 𝒕𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝒔𝒆𝒔𝒔𝒊𝒐𝒏𝒔 in the areas of 𝐒𝐨𝐟𝐭𝐰𝐚𝐫𝐞 𝐃𝐞𝐯𝐞𝐥𝐨𝐩𝐦𝐞𝐧𝐭, 𝐃𝐚𝐭𝐚 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠, 𝐂𝐥𝐨𝐮𝐝, 𝐃𝐚𝐭𝐚 𝐀𝐧𝐚𝐥𝐲𝐬𝐢𝐬, 𝐃𝐚𝐭𝐚 𝐕𝐢𝐬𝐮𝐚𝐥𝐢𝐳𝐚𝐭𝐢𝐨𝐧𝐬, 𝐀𝐫𝐭𝐢𝐟𝐢𝐜𝐢𝐚𝐥 𝐈𝐧𝐭𝐞𝐥𝐥𝐢𝐠𝐞𝐧𝐜𝐞 𝐚𝐧𝐝 𝐌𝐚𝐜𝐡𝐢𝐧𝐞 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠. I am interested in 𝐰𝐫𝐢𝐭𝐢𝐧𝐠 𝐛𝐥𝐨𝐠𝐬, 𝐬𝐡𝐚𝐫𝐢𝐧𝐠 𝐭𝐞𝐜𝐡𝐧𝐢𝐜𝐚𝐥 𝐤𝐧𝐨𝐰𝐥𝐞𝐝𝐠𝐞, 𝐬𝐨𝐥𝐯𝐢𝐧𝐠 𝐭𝐞𝐜𝐡𝐧𝐢𝐜𝐚𝐥 𝐢𝐬𝐬𝐮𝐞𝐬, 𝐫𝐞𝐚𝐝𝐢𝐧𝐠 𝐚𝐧𝐝 𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠 new subjects.

In this blog, we will learn how to process multiple delimited files using PySpark.

input.txt

id@@#name@@#experience
1@|#Naveen,Pn@|#11
2@|#Abdullah Madani@|#12
3@|#Vicken,Abajian@|#12

Approach - 01

Now let's create a DataFrame

df = spark.read.csv(path='input.txt',header=True)
df.show(truncate=False)

As you see we are not getting the desired output. Now let's specify with delimiter

Approach - 02

df = spark.read.csv(path='input.txt', sep='@|#', header=True)

Note: spark throws an error when we try to pass the delimiter of more than one character.

raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.IllegalArgumentException: 'Delimiter cannot be more than one character: @|#'

Approach - 03

In this approach we will use .text() method from DataFrameReader class to create DataFrame

df = spark.read.text("input.txt")
df.show(truncate=False)

Each line in a text file represents a record in DataFrame with just one column “value”. To convert into multiple columns, we will use map transformation and split method to transform and split the column values.

df = spark.read.text("input.txt")
header = df.first()[0]
df.filter(col('value') != header).rdd.map(lambda e: e[0].split('@|#')).toDF(schema).show()

#big-data #dataanalytics #apache-spark #interview

105 views

Comments

Join the discussion

No comments yet. Be the first to comment.

More from this blog

ACID Properties

RDBMS works under 4 properties (ACID) Atomicity If any operation is performed on the data, either the entire transaction should be executed or should not be executed at all. Single unit of work

May 20, 20262 min read15

Key Problems Microsoft Fabric Solves

Data Silos Across Tools Problem Organizations use many separate tools for ETL (Data Factory), Warehousing (Synapse/Snowflake), Big Data (Databricks/Hadoop), Visualization (Power BI/Tableau), etc.

Mar 3, 20264 min read6

Unity Catalog vs Hive Metastore

What is Hive Metastore Legacy metadata store for tables and schemas Linked to single Databricks workspace Stores based info : table names, locations, schema Limitation No centralized security across workspaces No column level access control H...

Jul 17, 20251 min read53

Advanced Python Dependency Injection with Pydantic and FastAPI

Introduction Modern backend architectures demand modular, maintainable, and testable code. One of the cornerstones of achieving this is Dependency Injection (DI) — a software design pattern that helps decouple object creation from business logic, ma...

Jun 20, 20255 min read455

Advanced Python Dependency Injection with Pydantic and FastAPI

Building Reactive Python Apps with Async Generators and Streams

Introduction Modern applications increasingly rely on real-time data streams — from chat apps and stock tickers to IoT device feeds, real-time analytics dashboards, and webhooks. The challenge isn’t just speed, but also how to process continuous str...

Jun 20, 20255 min read165

Building Reactive Python Apps with Async Generators and Streams

Naveen P.N's Tech Blog

95 posts

Command Palette

Comments

More from this blog