How to Filter a Spark DataFrame by Date?

Filtering a Spark DataFrame by date can be done using various methods depending on the date format and the library you’re using. Here, we’ll discuss how to filter a DataFrame by date in PySpark, which is a commonly used language among Spark users. We’ll cover filtering based on exact date matches as well as ranges, such as filtering all records within a specific year, month, or date range.

Filtering a DataFrame by Date in PySpark

Sample DataFrame

First, let’s create a sample DataFrame in PySpark with some date data:


from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from datetime import datetime

# Initialize Spark session
spark = SparkSession.builder.appName("DateFilterExample").getOrCreate()

# Sample data
data = [
    (1, "2023-01-01"),
    (2, "2022-12-01"),
    (3, "2021-06-15"),
    (4, "2023-05-16"),
    (5, "2020-07-24"),
]

# Create DataFrame
df = spark.createDataFrame(data, ["id", "date"])

df.show()

+---+----------+
| id|      date|
+---+----------+
|  1|2023-01-01|
|  2|2022-12-01|
|  3|2021-06-15|
|  4|2023-05-16|
|  5|2020-07-24|
+---+----------+

Filtering by Exact Date

To filter the DataFrame for records that match a specific date:


# Filter by specific date
filtered_df = df.filter(col('date') == '2023-01-01')
filtered_df.show()

+---+----------+
| id|      date|
+---+----------+
|  1|2023-01-01|
+---+----------+

Filtering by Date Range

To filter the DataFrame for a range of dates:


# Filter by date range
start_date = '2022-01-01'
end_date = '2022-12-31'

filtered_df = df.filter((col('date') >= start_date) & (col('date') <= end_date))
filtered_df.show()

+---+----------+
| id|      date|
+---+----------+
|  2|2022-12-01|
+---+----------+

Filtering by Year or Month

To filter the DataFrame for a specific year or month, you can use the `year` and `month` functions from `pyspark.sql.functions`:


from pyspark.sql.functions import year, month

# Filter by specific year
filtered_df = df.filter(year(col('date')) == 2023)
filtered_df.show()

+---+----------+
| id|      date|
+---+----------+
|  1|2023-01-01|
|  4|2023-05-16|
+---+----------+

# Filter by specific month
filtered_df = df.filter(month(col('date')) == 12)
filtered_df.show()

+---+----------+
| id|      date|
+---+----------+
|  2|2022-12-01|
+---+----------+

By using these methods, you can easily filter Spark DataFrames by date, whether you’re looking for exact matches, specific ranges, or particular years or months.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top