Processing Multi Char Delimiter Dataset using PySpark

In this blog, we will learn how to process multiple delimited files using PySpark.
input.txt
id@@#name@@#experience
1@|#Naveen,Pn@|#11
2@|#Abdullah Madani@|#12
3@|#Vicken,Abajian@|#12
Approach - 01
Now let's create a DataFrame
df = spark.read.csv(path='input.txt',header=True)
df.show(truncate=False)
As you see we are not getting the desired output. Now let's specify with delimiter
Approach - 02
df = spark.read.csv(path='input.txt', sep='@|#', header=True)
Note: spark throws an error when we try to pass the delimiter of more than one character.
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.IllegalArgumentException: 'Delimiter cannot be more than one character: @|#'
Approach - 03
In this approach we will use .text() method from DataFrameReader class to create DataFrame
df = spark.read.text("input.txt")
df.show(truncate=False)

Each line in a text file represents a record in DataFrame with just one column “value”. To convert into multiple columns, we will use map transformation and split method to transform and split the column values.
df = spark.read.text("input.txt")
header = df.first()[0]
df.filter(col('value') != header).rdd.map(lambda e: e[0].split('@|#')).toDF(schema).show()




