Apache Spark Date Functions: Truncating Date and Time

When working with date and time data in Apache Spark, there are often scenarios where you might need to truncate this data to a coarser granularity. For instance, you may want to analyze data at the level of years, quarters, or days, without considering the more precise time information like hours, minutes, or seconds. This is where Spark’s date functions come into play, allowing you to truncate dates and times accordingly. In this comprehensive guide, we will cover everything you need to know about truncating dates and times in Apache Spark using Scala.

Contents hide

1 Understanding Spark Date and Timestamp Types

2 Using DataFrame API for Date Truncation

2.1 The trunc() Function

2.2 The date_trunc() Function

3 Using Spark SQL Expressions

3.1 Truncating Dates in Spark SQL

3.2 Truncating Timestamps in Spark SQL

4 Considerations and Best Practices

5 About Editorial Team

6 You Might Also Like:

Understanding Spark Date and Timestamp Types

Apache Spark supports two primary types for dealing with dates and times:

DateType: Represents a date without time information (e.g., 2021-12-31).
TimestampType: Represents a date with time information, including seconds and even microsecond precision (e.g., 2021-12-31 23:59:59.123456).

It’s important to distinguish between these types because the functions you’ll work with may behave differently depending on which type of data you’re manipulating.

Using DataFrame API for Date Truncation

Spark SQL provides a DataFrame API that can be used to manipulate structured data, including date and time fields. The most common functions for date truncation in this API are trunc() and date_trunc().

The trunc() Function

The trunc() function truncates a date to the specified unit of time (e.g., year, month, etc.). It takes two parameters: the timestamp or date column and the string specifying the level of truncation desired.


import org.apache.spark.sql.functions.trunc
import spark.implicits._

val df = Seq("2021-12-15 10:15:00", "2022-01-01 11:00:00").toDF("datetime")
df.select($"datetime", trunc($"datetime", "year"), trunc($"datetime", "month")).show(false)

Here’s the output you’d expect:


+-------------------+--------------------+----------------------+
|datetime           |trunc(datetime,year)|trunc(datetime,month) |
+-------------------+--------------------+----------------------+
|2021-12-15 10:15:00|2021-01-01          |2021-12-01            |
|2022-01-01 11:00:00|2022-01-01          |2022-01-01            |
+-------------------+--------------------+----------------------+

As shown, the trunc() function resets the date to the first instance of the period specified (beginning of the year or month in this case).

The date_trunc() Function

On the other hand, date_trunc() provides similar functionality but extends it to timestamps, allowing you to truncate to various levels of time precision such as second, minute, hour, day, week, month, quarter, and year.


import org.apache.spark.sql.functions.date_trunc

df.select($"datetime", 
          date_trunc("year", $"datetime").as("year_trunc"), 
          date_trunc("month", $"datetime").as("month_trunc")).show(false)

Here’s the output you’d expect:


+-------------------+-------------------+-------------------+
|datetime           |year_trunc         |month_trunc        |
+-------------------+-------------------+-------------------+
|2021-12-15 10:15:00|2021-01-01 00:00:00|2021-12-01 00:00:00|
|2022-01-01 11:00:00|2022-01-01 00:00:00|2022-01-01 00:00:00|
+-------------------+-------------------+-------------------+

Notice that unlike the trunc() function which returns only the date component, date_trunc() returns a full timestamp with the time portion set to zero for the desired level of truncation.

Using Spark SQL Expressions

These date functions can also be used within Spark SQL expressions if you prefer to work with SQL syntax over the DataFrame API.

Truncating Dates in Spark SQL

The TRUNC() function can be used in Spark SQL to achieve date truncation:


df.createOrReplaceTempView("dates")
val sqlResult = spark.sql("SELECT datetime, TRUNC(datetime, 'YEAR') AS year_trunc, TRUNC(datetime, 'MONTH') AS month_trunc FROM dates")
sqlResult.show(false)

Output:


+-------------------+-------------------+-------------------+
|datetime           |year_trunc         |month_trunc        |
+-------------------+-------------------+-------------------+
|2021-12-15 10:15:00|2021-01-01         |2021-12-01         |
|2022-01-01 11:00:00|2022-01-01         |2022-01-01         |
+-------------------+-------------------+-------------------+

Here, we’re doing essentially the same thing we did with the DataFrame API, but using SQL syntax. We define a temporary view and then use a SELECT statement to return the truncated dates.

Truncating Timestamps in Spark SQL

To truncate timestamps with Spark SQL, you use the DATE_TRUNC() function:


val sqlTimestampResult = spark.sql("SELECT datetime, DATE_TRUNC('YEAR', datetime) AS year_trunc, DATE_TRUNC('MONTH', datetime) AS month_trunc FROM dates")
sqlTimestampResult.show(false)

Output:


+-------------------+-------------------+-------------------+
|datetime           |year_trunc         |month_trunc        |
+-------------------+-------------------+-------------------+
|2021-12-15 10:15:00|2021-01-01 00:00:00|2021-12-01 00:00:00|
|2022-01-01 11:00:00|2022-01-01 00:00:00|2022-01-01 00:00:00|
+-------------------+-------------------+-------------------+

The DATE_TRUNC() function in Spark SQL has similar function to date_trunc() in the DataFrame API, here used within the SQL context to truncate the datetime field.

Considerations and Best Practices

When truncating dates and times in Spark, there are a few considerations and best practices to keep in mind:

Time Zone Awareness: Apache Spark uses the JVM’s time zone settings by default. Be aware of this when working with timestamps, and consider setting the time zone explicitly if needed to avoid unexpected behavior.
Data Types: Ensure that the column you’re working with is of the correct data type (DateType or TimestampType). If necessary, use the to_date() or to_timestamp() functions to convert it before applying truncation functions.
Performance: For large datasets, consider whether you need the truncated dates as new columns in your DataFrame, or if you can simply group by the truncated value for aggregation. Adding unnecessary columns can increase memory usage and processing time.

In summary, Spark provides powerful functions for working with date and time data, including truncating dates to various levels of granularity. By understanding and leveraging the trunc() and date_trunc() functions, as well as their SQL equivalents, you can perform a wide range of date and time manipulations efficiently in your data processing workflows. Whether you’re working with the DataFrame API or Spark SQL, these functions are essential tools for any data engineer or analyst working with temporal data in Apache Spark.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.