Filtering a DataFrame based on a partial string match is a common task when working with data in PySpark. You can achieve this using the `filter` or `where` methods along with the `like` function provided by PySpark’s SQL functions module.
Example
Let’s consider a DataFrame containing information about various products, and filter the DataFrame to include only those products whose names contain a specific substring.
Step-by-Step Explanation
First, we need to create a sample DataFrame:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Initialize Spark session
spark = SparkSession.builder.appName("FilterDataFrameExample").getOrCreate()
# Sample data
data = [
("Apple iPhone 12", 799),
("Samsung Galaxy S21", 999),
("Google Pixel 5", 699),
("Apple MacBook Pro", 1299),
("Microsoft Surface", 899)
]
# Create DataFrame
columns = ["Product_Name", "Price"]
df = spark.createDataFrame(data, schema=columns)
# Display the DataFrame
df.show()
+-------------------+-----+
| Product_Name|Price|
+-------------------+-----+
| Apple iPhone 12| 799|
| Samsung Galaxy S21| 999|
| Google Pixel 5| 699|
|Apple MacBook Pro| 1299|
| Microsoft Surface| 899|
+-------------------+-----+
Next, we will filter the DataFrame to include only those products whose names contain the substring ‘Apple’.
Using .filter() with .like()
Here’s how you can perform the filtering:
# Filter DataFrame where Product_Name contains 'Apple'
filtered_df = df.filter(col("Product_Name").like("%Apple%"))
# Display the DataFrame
filtered_df.show()
+-------------------+-----+
| Product_Name|Price|
+-------------------+-----+
| Apple iPhone 12| 799|
|Apple MacBook Pro| 1299|
+-------------------+-----+
In the above snippet, `df.filter(col(“Product_Name”).like(“%Apple%”))` filters the DataFrame to include only the rows where `Product_Name` contains the substring ‘Apple’. The percent signs `%` are wildcards that match zero or more characters.
Using .where() with .like()
You can achieve the same result using the `where` method, as shown below:
# Filter DataFrame using .where() method
filtered_df = df.where(col("Product_Name").like("%Apple%"))
# Display the DataFrame
filtered_df.show()
+-------------------+-----+
| Product_Name|Price|
+-------------------+-----+
| Apple iPhone 12| 799|
|Apple MacBook Pro| 1299|
+-------------------+-----+
Both `filter` and `where` methods can be used interchangeably for filtering DataFrames in PySpark. This example demonstrates how to filter rows based on partial string matching using the `like` function.