Filtering a Spark DataFrame by date can be done using various methods depending on the date format and the library you’re using. Here, we’ll discuss how to filter a DataFrame by date in PySpark, which is a commonly used language among Spark users. We’ll cover filtering based on exact date matches as well as ranges, such as filtering all records within a specific year, month, or date range.
Filtering a DataFrame by Date in PySpark
Sample DataFrame
First, let’s create a sample DataFrame in PySpark with some date data:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from datetime import datetime
# Initialize Spark session
spark = SparkSession.builder.appName("DateFilterExample").getOrCreate()
# Sample data
data = [
(1, "2023-01-01"),
(2, "2022-12-01"),
(3, "2021-06-15"),
(4, "2023-05-16"),
(5, "2020-07-24"),
]
# Create DataFrame
df = spark.createDataFrame(data, ["id", "date"])
df.show()
+---+----------+
| id| date|
+---+----------+
| 1|2023-01-01|
| 2|2022-12-01|
| 3|2021-06-15|
| 4|2023-05-16|
| 5|2020-07-24|
+---+----------+
Filtering by Exact Date
To filter the DataFrame for records that match a specific date:
# Filter by specific date
filtered_df = df.filter(col('date') == '2023-01-01')
filtered_df.show()
+---+----------+
| id| date|
+---+----------+
| 1|2023-01-01|
+---+----------+
Filtering by Date Range
To filter the DataFrame for a range of dates:
# Filter by date range
start_date = '2022-01-01'
end_date = '2022-12-31'
filtered_df = df.filter((col('date') >= start_date) & (col('date') <= end_date))
filtered_df.show()
+---+----------+
| id| date|
+---+----------+
| 2|2022-12-01|
+---+----------+
Filtering by Year or Month
To filter the DataFrame for a specific year or month, you can use the `year` and `month` functions from `pyspark.sql.functions`:
from pyspark.sql.functions import year, month
# Filter by specific year
filtered_df = df.filter(year(col('date')) == 2023)
filtered_df.show()
+---+----------+
| id| date|
+---+----------+
| 1|2023-01-01|
| 4|2023-05-16|
+---+----------+
# Filter by specific month
filtered_df = df.filter(month(col('date')) == 12)
filtered_df.show()
+---+----------+
| id| date|
+---+----------+
| 2|2022-12-01|
+---+----------+
By using these methods, you can easily filter Spark DataFrames by date, whether you’re looking for exact matches, specific ranges, or particular years or months.