How to Trim a String Column in PySpark DataFrame?

How to Trim a String Column in PySpark DataFrame?

Trimming a string refers to removing leading and trailing whitespace from the string. In PySpark, the `trim` function from the `pyspark.sql.functions` module is used to trim string columns in a DataFrame. You can use the `trim`, `ltrim` (to remove left whitespace), and `rtrim` (to remove right whitespace) functions depending on your needs.

Using the `trim` Function in PySpark

Here’s how you can use the `trim` function to remove leading and trailing whitespace from a string column in a PySpark DataFrame:

Example


from pyspark.sql import SparkSession
from pyspark.sql.functions import trim

# Initialize SparkSession
spark = SparkSession.builder.master("local[*]").appName("TrimExample").getOrCreate()

# Sample Data
data = [("   John Doe   ",), ("   Jane Doe   ",), ("Alice   ",), ("   Bob",)]
columns = ["name"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Show original DataFrame
df.show(truncate=False)

# Use trim function
trimmed_df = df.withColumn("trimmed_name", trim(df["name"]))

# Show DataFrame with trimmed column
trimmed_df.show(truncate=False)

Output


+------------+
|name        |
+------------+
|   John Doe   |
|   Jane Doe   |
|Alice         |
|   Bob        |
+------------+

+------------+------------+
|name        |trimmed_name|
+------------+------------+
|   John Doe   |John Doe    |
|   Jane Doe   |Jane Doe    |
|Alice         |Alice       |
|   Bob        |Bob         |
+------------+------------+

In the above example, we initialized a `SparkSession` and created a DataFrame with a single column containing strings with leading and trailing spaces. We then applied the `trim` function to remove these spaces and created a new column named `trimmed_name` with the trimmed strings.

Other Trim Functions

As mentioned earlier, PySpark also provides `ltrim` and `rtrim` functions to remove leading and trailing whitespace respectively.

Example of `ltrim`


from pyspark.sql.functions import ltrim

# Use ltrim function
ltrim_df = df.withColumn("ltrimmed_name", ltrim(df["name"]))

# Show DataFrame with left-trimmed column
ltrim_df.show(truncate=False)

Example of `rtrim`


from pyspark.sql.functions import rtrim

# Use rtrim function
rtrim_df = df.withColumn("rtrimmed_name", rtrim(df["name"]))

# Show DataFrame with right-trimmed column
rtrim_df.show(truncate=False)

Output of `ltrim`


+------------+-------------+
|name        |ltrimmed_name|
+------------+-------------+
|   John Doe   |John Doe   |
|   Jane Doe   |Jane Doe   |
|Alice         |Alice      |
|   Bob        |Bob        |
+------------+-------------+

Output of `rtrim`


+------------+-------------+
|name        |rtrimmed_name|
+------------+-------------+
|   John Doe   |   John Doe|
|   Jane Doe   |   Jane Doe|
|Alice         |Alice      |
|   Bob        |   Bob     |
+------------+-------------+

By using `ltrim` and `rtrim`, you can control which side of the string you want to trim the whitespace from.

With these examples, you should now be able to effectively trim string columns in a PySpark DataFrame using the `trim`, `ltrim`, and `rtrim` functions.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top