String Manipulation in PySpark!

UpdatedApril 17, 2025

•1 min read

I am a Tech Enthusiast having 13+ years of experience in 𝐈𝐓 as a 𝐂𝐨𝐧𝐬𝐮𝐥𝐭𝐚𝐧𝐭, 𝐂𝐨𝐫𝐩𝐨𝐫𝐚𝐭𝐞 𝐓𝐫𝐚𝐢𝐧𝐞𝐫, 𝐌𝐞𝐧𝐭𝐨𝐫, with 12+ years in training and mentoring in 𝐒𝐨𝐟𝐭𝐰𝐚𝐫𝐞 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠, 𝐃𝐚𝐭𝐚 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠, 𝐓𝐞𝐬𝐭 𝐀𝐮𝐭𝐨𝐦𝐚𝐭𝐢𝐨𝐧 𝐚𝐧𝐝 𝐃𝐚𝐭𝐚 𝐒𝐜𝐢𝐞𝐧𝐜𝐞. I have 𝒕𝒓𝒂𝒊𝒏𝒆𝒅 𝒎𝒐𝒓𝒆 𝒕𝒉𝒂𝒏 10,000+ 𝑰𝑻 𝑷𝒓𝒐𝒇𝒆𝒔𝒔𝒊𝒐𝒏𝒂𝒍𝒔 and 𝒄𝒐𝒏𝒅𝒖𝒄𝒕𝒆𝒅 𝒎𝒐𝒓𝒆 𝒕𝒉𝒂𝒏 500+ 𝒕𝒓𝒂𝒊𝒏𝒊𝒏𝒈 𝒔𝒆𝒔𝒔𝒊𝒐𝒏𝒔 in the areas of 𝐒𝐨𝐟𝐭𝐰𝐚𝐫𝐞 𝐃𝐞𝐯𝐞𝐥𝐨𝐩𝐦𝐞𝐧𝐭, 𝐃𝐚𝐭𝐚 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠, 𝐂𝐥𝐨𝐮𝐝, 𝐃𝐚𝐭𝐚 𝐀𝐧𝐚𝐥𝐲𝐬𝐢𝐬, 𝐃𝐚𝐭𝐚 𝐕𝐢𝐬𝐮𝐚𝐥𝐢𝐳𝐚𝐭𝐢𝐨𝐧𝐬, 𝐀𝐫𝐭𝐢𝐟𝐢𝐜𝐢𝐚𝐥 𝐈𝐧𝐭𝐞𝐥𝐥𝐢𝐠𝐞𝐧𝐜𝐞 𝐚𝐧𝐝 𝐌𝐚𝐜𝐡𝐢𝐧𝐞 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠. I am interested in 𝐰𝐫𝐢𝐭𝐢𝐧𝐠 𝐛𝐥𝐨𝐠𝐬, 𝐬𝐡𝐚𝐫𝐢𝐧𝐠 𝐭𝐞𝐜𝐡𝐧𝐢𝐜𝐚𝐥 𝐤𝐧𝐨𝐰𝐥𝐞𝐝𝐠𝐞, 𝐬𝐨𝐥𝐯𝐢𝐧𝐠 𝐭𝐞𝐜𝐡𝐧𝐢𝐜𝐚𝐥 𝐢𝐬𝐬𝐮𝐞𝐬, 𝐫𝐞𝐚𝐝𝐢𝐧𝐠 𝐚𝐧𝐝 𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠 new subjects.

In the world of data processing and analysis, data cleanliness is paramount. That's where PySpark's trim, ltrim, and rtrim functions come into play! They're your trusty allies for tidying up strings in DataFrames.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SparkDemoApp").getOrCreate()
data = [(" Java ",), (" Scala ",), (" Python ",)]
df = spark.createDataFrame(data, ["languages"])

Using trim()

Trim leading and trailing spaces

from pyspark.sql.functions import trim, col
df = df.withColumn("cleaned_data", trim(col("languages")))
df.show()

Using .ltrim()

Trim leading spaces

from pyspark.sql.functions import ltrim, col
df = df.withColumn("cleaned_data", ltrim(col("languages")))
df.show()

Using .rtrim()

Trim white spaces at the end

df = df.withColumn("cleaned_data", rtrim(col("languages")))
df.show()

Do you want to connect with me I have started mentoring for career and interviews at 𝐭𝐨𝐩𝐦𝐚𝐭𝐞.𝐢𝐨/𝐧𝐚𝐯𝐞𝐞𝐧𝐩𝐧

#apache-spark #big-data #data-analytics

88 views

Comments

Join the discussion

No comments yet. Be the first to comment.

More from this blog

ACID Properties

RDBMS works under 4 properties (ACID) Atomicity If any operation is performed on the data, either the entire transaction should be executed or should not be executed at all. Single unit of work

May 20, 20262 min read

Key Problems Microsoft Fabric Solves

Data Silos Across Tools Problem Organizations use many separate tools for ETL (Data Factory), Warehousing (Synapse/Snowflake), Big Data (Databricks/Hadoop), Visualization (Power BI/Tableau), etc.

Mar 3, 20264 min read6

Unity Catalog vs Hive Metastore

What is Hive Metastore Legacy metadata store for tables and schemas Linked to single Databricks workspace Stores based info : table names, locations, schema Limitation No centralized security across workspaces No column level access control H...

Jul 17, 20251 min read50

Advanced Python Dependency Injection with Pydantic and FastAPI

Introduction Modern backend architectures demand modular, maintainable, and testable code. One of the cornerstones of achieving this is Dependency Injection (DI) — a software design pattern that helps decouple object creation from business logic, ma...

Jun 20, 20255 min read417

Advanced Python Dependency Injection with Pydantic and FastAPI

Building Reactive Python Apps with Async Generators and Streams

Introduction Modern applications increasingly rely on real-time data streams — from chat apps and stock tickers to IoT device feeds, real-time analytics dashboards, and webhooks. The challenge isn’t just speed, but also how to process continuous str...

Jun 20, 20255 min read125

Building Reactive Python Apps with Async Generators and Streams

Naveen P.N's Tech Blog

95 posts

Command Palette

Using trim()

Using .ltrim()

Using .rtrim()

Comments

More from this blog