How to Trim a String Column in PySpark DataFrame?
Trimming a string refers to removing leading and trailing whitespace from the string. In PySpark, the `trim` function from the `pyspark.sql.functions` module is used to trim string columns in a DataFrame. You can use the `trim`, `ltrim` (to remove left whitespace), and `rtrim` (to remove right whitespace) functions depending on your needs.
Using the `trim` Function in PySpark
Here’s how you can use the `trim` function to remove leading and trailing whitespace from a string column in a PySpark DataFrame:
Example
from pyspark.sql import SparkSession
from pyspark.sql.functions import trim
# Initialize SparkSession
spark = SparkSession.builder.master("local[*]").appName("TrimExample").getOrCreate()
# Sample Data
data = [(" John Doe ",), (" Jane Doe ",), ("Alice ",), (" Bob",)]
columns = ["name"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Show original DataFrame
df.show(truncate=False)
# Use trim function
trimmed_df = df.withColumn("trimmed_name", trim(df["name"]))
# Show DataFrame with trimmed column
trimmed_df.show(truncate=False)
Output
+------------+
|name |
+------------+
| John Doe |
| Jane Doe |
|Alice |
| Bob |
+------------+
+------------+------------+
|name |trimmed_name|
+------------+------------+
| John Doe |John Doe |
| Jane Doe |Jane Doe |
|Alice |Alice |
| Bob |Bob |
+------------+------------+
In the above example, we initialized a `SparkSession` and created a DataFrame with a single column containing strings with leading and trailing spaces. We then applied the `trim` function to remove these spaces and created a new column named `trimmed_name` with the trimmed strings.
Other Trim Functions
As mentioned earlier, PySpark also provides `ltrim` and `rtrim` functions to remove leading and trailing whitespace respectively.
Example of `ltrim`
from pyspark.sql.functions import ltrim
# Use ltrim function
ltrim_df = df.withColumn("ltrimmed_name", ltrim(df["name"]))
# Show DataFrame with left-trimmed column
ltrim_df.show(truncate=False)
Example of `rtrim`
from pyspark.sql.functions import rtrim
# Use rtrim function
rtrim_df = df.withColumn("rtrimmed_name", rtrim(df["name"]))
# Show DataFrame with right-trimmed column
rtrim_df.show(truncate=False)
Output of `ltrim`
+------------+-------------+
|name |ltrimmed_name|
+------------+-------------+
| John Doe |John Doe |
| Jane Doe |Jane Doe |
|Alice |Alice |
| Bob |Bob |
+------------+-------------+
Output of `rtrim`
+------------+-------------+
|name |rtrimmed_name|
+------------+-------------+
| John Doe | John Doe|
| Jane Doe | Jane Doe|
|Alice |Alice |
| Bob | Bob |
+------------+-------------+
By using `ltrim` and `rtrim`, you can control which side of the string you want to trim the whitespace from.
With these examples, you should now be able to effectively trim string columns in a PySpark DataFrame using the `trim`, `ltrim`, and `rtrim` functions.