When you need to filter data (i.e., select rows that satisfy a given condition) in Spark, you commonly use the `select` and `where` (or `filter`) operations. These operations allow you to retrieve specific columns and rows that meet your criteria. Below, we will cover examples using PySpark.
Using Select and Where/Filter in PySpark
Let’s start by creating a simple DataFrame in PySpark to demonstrate these operations.
Example DataFrame Creation
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Initialize Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
# Sample data
data = [
(1, "Alice", 29),
(2, "Bob", 22),
(3, "Cathy", 25),
(4, "David", 22)
]
# Define schema
columns = ["id", "name", "age"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
df.show()
Output of `df.show()`:
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1|Alice| 29|
| 2| Bob| 22|
| 3|Cathy| 25|
| 4|David| 22|
+---+-----+---+
Select Specific Columns
# Select specific columns
selected_columns_df = df.select("name", "age")
selected_columns_df.show()
Output of `selected_columns_df.show()`:
+-----+---+
| name|age|
+-----+---+
|Alice| 29|
| Bob| 22|
|Cathy| 25|
|David| 22|
+-----+---+
Using Where (or Filter) to Filter Rows
Filter rows where age is greater than 22.
# Filter rows
filtered_df = df.where(col("age") > 22)
filtered_df.show()
Output of `filtered_df.show()`:
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1|Alice| 29|
| 3|Cathy| 25|
+---+-----+---+
Combining Select and Where Operations
Select the ‘name’ column from rows where ‘age’ is greater than 22.
# Combine select and filter operations
selected_filtered_df = df.select("name").where(col("age") > 22)
selected_filtered_df.show()
Output of `selected_filtered_df.show()`:
+-----+
| name|
+-----+
|Alice|
|Cathy|
+-----+
Conclusion
Using the `select` and `where` (or `filter`) operations in Apache Spark allows you to manipulate and query your data efficiently. These operations are essential when you’re dealing with large datasets and need to retrieve specific columns and/or rows that meet certain criteria.