How to Filter a DataFrame When Values Match Part of a String in PySpark?

Filtering a DataFrame based on a partial string match is a common task when working with data in PySpark. You can achieve this using the `filter` or `where` methods along with the `like` function provided by PySpark’s SQL functions module.

Contents hide

1 Example

1.1 Step-by-Step Explanation

1.2 Using .filter() with .like()

1.3 Using .where() with .like()

2 About Editorial Team

3 You Might Also Like:

Example

Let’s consider a DataFrame containing information about various products, and filter the DataFrame to include only those products whose names contain a specific substring.

Step-by-Step Explanation

First, we need to create a sample DataFrame:


from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize Spark session
spark = SparkSession.builder.appName("FilterDataFrameExample").getOrCreate()

# Sample data
data = [
    ("Apple iPhone 12", 799),
    ("Samsung Galaxy S21", 999),
    ("Google Pixel 5", 699),
    ("Apple MacBook Pro", 1299),
    ("Microsoft Surface", 899)
]

# Create DataFrame
columns = ["Product_Name", "Price"]
df = spark.createDataFrame(data, schema=columns)

# Display the DataFrame
df.show()


+-------------------+-----+
|        Product_Name|Price|
+-------------------+-----+
|      Apple iPhone 12|  799|
|    Samsung Galaxy S21|  999|
|        Google Pixel 5|  699|
|Apple MacBook Pro| 1299|
|   Microsoft Surface|  899|
+-------------------+-----+

Next, we will filter the DataFrame to include only those products whose names contain the substring ‘Apple’.

Using .filter() with .like()

Here’s how you can perform the filtering:


# Filter DataFrame where Product_Name contains 'Apple'
filtered_df = df.filter(col("Product_Name").like("%Apple%"))

# Display the DataFrame
filtered_df.show()


+-------------------+-----+
|        Product_Name|Price|
+-------------------+-----+
|      Apple iPhone 12|  799|
|Apple MacBook Pro| 1299|
+-------------------+-----+

In the above snippet, `df.filter(col(“Product_Name”).like(“%Apple%”))` filters the DataFrame to include only the rows where `Product_Name` contains the substring ‘Apple’. The percent signs `%` are wildcards that match zero or more characters.

Using .where() with .like()

You can achieve the same result using the `where` method, as shown below:


# Filter DataFrame using .where() method
filtered_df = df.where(col("Product_Name").like("%Apple%"))

# Display the DataFrame
filtered_df.show()


+-------------------+-----+
|        Product_Name|Price|
+-------------------+-----+
|      Apple iPhone 12|  799|
|Apple MacBook Pro| 1299|
+-------------------+-----+

Both `filter` and `where` methods can be used interchangeably for filtering DataFrames in PySpark. This example demonstrates how to filter rows based on partial string matching using the `like` function.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Example

Step-by-Step Explanation

Using .filter() with .like()

Using .where() with .like()

About Editorial Team

You Might Also Like:

Leave a Comment Cancel Reply