How to Filter a DataFrame When Values Match Part of a String in PySpark?

Filtering a DataFrame based on a partial string match is a common task when working with data in PySpark. You can achieve this using the `filter` or `where` methods along with the `like` function provided by PySpark’s SQL functions module.

Example

Let’s consider a DataFrame containing information about various products, and filter the DataFrame to include only those products whose names contain a specific substring.

Step-by-Step Explanation

First, we need to create a sample DataFrame:


from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize Spark session
spark = SparkSession.builder.appName("FilterDataFrameExample").getOrCreate()

# Sample data
data = [
    ("Apple iPhone 12", 799),
    ("Samsung Galaxy S21", 999),
    ("Google Pixel 5", 699),
    ("Apple MacBook Pro", 1299),
    ("Microsoft Surface", 899)
]

# Create DataFrame
columns = ["Product_Name", "Price"]
df = spark.createDataFrame(data, schema=columns)

# Display the DataFrame
df.show()

+-------------------+-----+
|        Product_Name|Price|
+-------------------+-----+
|      Apple iPhone 12|  799|
|    Samsung Galaxy S21|  999|
|        Google Pixel 5|  699|
|Apple MacBook Pro| 1299|
|   Microsoft Surface|  899|
+-------------------+-----+

Next, we will filter the DataFrame to include only those products whose names contain the substring ‘Apple’.

Using .filter() with .like()

Here’s how you can perform the filtering:


# Filter DataFrame where Product_Name contains 'Apple'
filtered_df = df.filter(col("Product_Name").like("%Apple%"))

# Display the DataFrame
filtered_df.show()

+-------------------+-----+
|        Product_Name|Price|
+-------------------+-----+
|      Apple iPhone 12|  799|
|Apple MacBook Pro| 1299|
+-------------------+-----+

In the above snippet, `df.filter(col(“Product_Name”).like(“%Apple%”))` filters the DataFrame to include only the rows where `Product_Name` contains the substring ‘Apple’. The percent signs `%` are wildcards that match zero or more characters.

Using .where() with .like()

You can achieve the same result using the `where` method, as shown below:


# Filter DataFrame using .where() method
filtered_df = df.where(col("Product_Name").like("%Apple%"))

# Display the DataFrame
filtered_df.show()

+-------------------+-----+
|        Product_Name|Price|
+-------------------+-----+
|      Apple iPhone 12|  799|
|Apple MacBook Pro| 1299|
+-------------------+-----+

Both `filter` and `where` methods can be used interchangeably for filtering DataFrames in PySpark. This example demonstrates how to filter rows based on partial string matching using the `like` function.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top