Skip to main content

Command Palette

Search for a command to run...

Processing Multi Char Delimiter Dataset using PySpark

Updated
1 min read
Processing Multi Char Delimiter Dataset using PySpark

In this blog, we will learn how to process multiple delimited files using PySpark.

input.txt

id@@#name@@#experience
1@|#Naveen,Pn@|#11
2@|#Abdullah Madani@|#12
3@|#Vicken,Abajian@|#12

Approach - 01

Now let's create a DataFrame

df = spark.read.csv(path='input.txt',header=True)
df.show(truncate=False)

As you see we are not getting the desired output. Now let's specify with delimiter

Approach - 02

df = spark.read.csv(path='input.txt', sep='@|#', header=True)

Note: spark throws an error when we try to pass the delimiter of more than one character.

raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.IllegalArgumentException: 'Delimiter cannot be more than one character: @|#'

Approach - 03

In this approach we will use .text() method from DataFrameReader class to create DataFrame

df = spark.read.text("input.txt")
df.show(truncate=False)

Each line in a text file represents a record in DataFrame with just one column “value”. To convert into multiple columns, we will use map transformation and split method to transform and split the column values.

df = spark.read.text("input.txt")
header = df.first()[0]
df.filter(col('value') != header).rdd.map(lambda e: e[0].split('@|#')).toDF(schema).show()

More from this blog

Naveen P.N's Tech Blog

94 posts