PySpark Convert Timestamp to Date - Apache Spark Tutorial

Working with date and time data is a common task in data analysis and processing. Apache Spark, through its PySpark interface, provides comprehensive support for dealing with temporal data types. One such operation that often arises is the need to convert timestamp data to date type. This conversion can be crucial when you want to truncate time information and focus solely on the date. In this guide, we will explore various methods to convert a timestamp to a date in PySpark.

Contents hide

1 Understanding Timestamp and Date Types in PySpark

2 Setting Up the Spark Session

3 Creating a DataFrame with Timestamps

4 Converting Timestamp to Date

4.1 Using `to_date` Function

4.2 Using `date_format` Function

4.2.1 Converting Formatted String Back to Date

4.3 Using SQL Expressions

5 Handling Different Timestamp Formats

6 Conclusion

7 About Editorial Team

8 You Might Also Like:

Understanding Timestamp and Date Types in PySpark

Before diving into the conversion process, it’s important to understand the distinction between timestamp and date types in PySpark. A timestamp type includes both date and time information down to the microsecond, while the date type contains only year, month, and day components.

In PySpark, timestamp and date are represented as ‘TimestampType’ and ‘DateType’, respectively. When working with these types, Spark provides built-in functions to facilitate the conversion and manipulation of temporal data.

Setting Up the Spark Session

To begin with any PySpark code, we must first establish a Spark session. If you haven’t already installed PySpark, you can do so using pip:

pip install pyspark

Then, start your Spark session with the following Python code:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("TimestampToDate") \
    .getOrCreate()

Creating a DataFrame with Timestamps

For demonstration purposes, let’s create a DataFrame with a column containing timestamps:

from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, TimestampType

# Define schema for our DataFrame
schema = StructType([
    StructField("datetime", TimestampType())
])

# Create a DataFrame with timestamps
timestamps = [("2021-03-20 15:30:00",), ("2022-07-15 20:45:30",)]
rdd = spark.sparkContext.parallelize(timestamps)
df = rdd.map(lambda x: Row(*x)).toDF(schema)

df.show(truncate=False)

The output will display the DataFrame containing a ‘datetime’ column with timestamp values:

+-------------------+
|datetime           |
+-------------------+
|2021-03-20 15:30:00|
|2022-07-15 20:45:30|
+-------------------+

Converting Timestamp to Date

Now let’s explore different methods for converting the timestamp column to a date type column.

Using `to_date` Function

PySpark provides a built-in function `to_date` that can be used to convert a timestamp column to date:

from pyspark.sql.functions import to_date

# Convert 'datetime' to a Date type column
df_with_date = df.withColumn("date", to_date(df.datetime))

df_with_date.show(truncate=False)

The resulting DataFrame will have an additional ‘date’ column of type DateType:

+-------------------+----------+
|datetime           |date      |
+-------------------+----------+
|2021-03-20 15:30:00|2021-03-20|
|2022-07-15 20:45:30|2022-07-15|
+-------------------+----------+

Using `date_format` Function

Another way to extract the date component from a timestamp is by using the `date_format` function. It allows us to specify the format of the output string:

from pyspark.sql.functions import date_format

# Format 'datetime' as a string with only the date
df_with_date_str = df.withColumn("date_str", date_format(df.datetime, "yyyy-MM-dd"))

df_with_date_str.show(truncate=False)

This outputs:

+-------------------+----------+
|datetime           |date_str  |
+-------------------+----------+
|2021-03-20 15:30:00|2021-03-20|
|2022-07-15 20:45:30|2022-07-15|
+-------------------+----------+

Converting Formatted String Back to Date

Since `date_format` returns a string, we may want to convert it back to a DateType column:

df_with_date_str = df_with_date_str.withColumn("date", to_date(df_with_date_str.date_str))
df_with_date_str.show(truncate=False)

The DataFrame now includes a properly typed ‘date’ column:

+-------------------+----------+----------+
|datetime           |date_str  |date      |
+-------------------+----------+----------+
|2021-03-20 15:30:00|2021-03-20|2021-03-20|
|2022-07-15 20:45:30|2022-07-15|2022-07-15|
+-------------------+----------+----------+

Using SQL Expressions

We can also use PySpark SQL functions to execute SQL-like queries for our conversion. First, we’ll need to register the DataFrame as a temporary SQL table:

df.createOrReplaceTempView("timestamps_table")

sql_query = "SELECT datetime, to_date(datetime) as date FROM timestamps_table"
df_with_date_sql = spark.sql(sql_query)

df_with_date_sql.show(truncate=False)

The SQL approach yields the same result:

+-------------------+----------+
|datetime           |date      |
+-------------------+----------+
|2021-03-20 15:30:00|2021-03-20|
|2022-07-15 20:45:30|2022-07-15|
+-------------------+----------+

Handling Different Timestamp Formats

If your DataFrame’s timestamp column comes in a different format than the standard one recognized by PySpark, you might need to parse it using the `unix_timestamp` function along with `to_date`:

from pyspark.sql.functions import unix_timestamp

# Let's assume we have timestamps in a non-standard format
non_standard_timestamps = [("20/03/2021 15:30:00",), ("15/07/2022 20:45:30",)]
rdd_non_standard = spark.sparkContext.parallelize(non_standard_timestamps)
df_non_standard = rdd_non_standard.map(lambda x: Row(*x)).toDF(schema)

# Parse the non-standard timestamp format and convert it to date
df_non_standard_date = df_non_standard.withColumn(
    "date",
    to_date(unix_timestamp(df_non_standard.datetime, "dd/MM/yyyy HH:mm:ss").cast("timestamp"))
)

df_non_standard_date.show(truncate=False)

The output will be as expected:

+-------------------+----------+
|datetime           |date      |
+-------------------+----------+
|20/03/2021 15:30:00|2021-03-20|
|15/07/2022 20:45:30|2022-07-15|
+-------------------+----------+

Conclusion

Converting timestamps to dates is an essential task in PySpark. We covered several methods to achieve this conversion, from using built-in functions like `to_date` and `date_format` to employing SQL expressions. It’s important to select the method that best fits the context of your data and the specific needs of your PySpark application. As we have seen, PySpark provides the flexibility to handle standard and non-standard formats, making it a powerful tool for temporal data manipulation.

By carefully managing and understanding the conversion between timestamp and date types, you can ensure that your data pipelines are accurate, efficient, and ready for further analysis or reporting.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.