Working with UNIX Timestamps in Spark SQL

Apache Spark is a powerful open-source distributed computing system that provides an easy-to-use and flexible API for big data processing. One common task when working with big data is handling date and time values, which often includes working with UNIX timestamps. In this article, we will delve into how we can work with UNIX timestamps using Spark SQL and Scala.

Understanding UNIX Timestamps

UNIX timestamps are widely used in computing to track time. A UNIX timestamp is simply the number of seconds that have passed since the ‘epoch’, or January 1, 1970, at UTC. They are useful for representing a point in time in a simple integer format. Nevertheless, they do not take into account leap seconds and are typically in UTC, which means they do not contain information about time zones.

Setting Up Spark Session

Firstly, we need to set up our Spark Session, which is the entry point to programming with the Spark API:


import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("UnixTimestamps")
  .master("local[*]") // Using local mode for the purpose of this illustration
  .getOrCreate()

import spark.implicits._

Working with UNIX Timestamps in Spark SQL

Converting UNIX Timestamps to Human-Readable Dates

One of the most common operations performed on UNIX timestamps is converting them to a human-readable date format. We can do this using the built-in Spark SQL function `from_unixtime`:


import org.apache.spark.sql.functions._

val data = Seq(1617266400, 1617352800, 1617439200) // A sample dataset with UNIX timestamps
val df = data.toDF("timestamp")

val readableDF = df.select(
  $"timestamp",
  from_unixtime($"timestamp").as("readable_date")
)

readableDF.show(false)

// Output:
// +----------+-------------------+
// |timestamp |readable_date      |
// +----------+-------------------+
// |1617266400|2021-04-01 00:00:00|
// |1617352800|2021-04-02 00:00:00|
// |1617439200|2021-04-03 00:00:00|
// +----------+-------------------+

Working with Time Zones

Since UNIX timestamps are in UTC by default, we often need to convert them to a specific time zone:


val timeZone = "America/New_York"
val timezoneDF = readableDF.select(
  $"timestamp",
  from_unixtime($"timestamp", "yyyy-MM-dd HH:mm:ss").as("utc_time"),
  from_utc_timestamp(from_unixtime($"timestamp"), timeZone).as("ny_time")
)

timezoneDF.show(false)

// Output:
// +----------+-------------------+-------------------+
// |timestamp |utc_time           |ny_time            |
// +----------+-------------------+-------------------+
// |1617266400|2021-04-01 00:00:00|2021-03-31 20:00:00|
// |1617352800|2021-04-02 00:00:00|2021-04-01 20:00:00|
// |1617439200|2021-04-03 00:00:00|2021-04-02 20:00:00|
// +----------+-------------------+-------------------+

Converting Dates to UNIX Timestamps

We can also go the other way, converting human-readable dates to UNIX timestamps:


import org.apache.spark.sql.functions.unix_timestamp

val dateDF = Seq("2021-04-01 00:00:00", "2021-04-02 12:34:56", "2021-04-03 23:59:59").toDF("readable_date")

val timestampDF = dateDF.select(
  $"readable_date",
  unix_timestamp($"readable_date").as("timestamp")
)

timestampDF.show(false)

// Output:
// +-------------------+----------+
// |readable_date      |timestamp |
// +-------------------+----------+
// |2021-04-01 00:00:00|1617235200|
// |2021-04-02 12:34:56|1617372896|
// |2021-04-03 23:59:59|1617499199|
// +-------------------+----------+

Complex Date and Time Operations

Adding and Subtracting Time

You might want to add or subtract a certain amount of time from a UNIX timestamp:


val offsetDF = df.select(
  $"timestamp",
  (($"timestamp" + 3600 * 24) * 1000).as("timestamp_plus_one_day"),  // Adding one day
  (($"timestamp" - 3600 * 24) * 1000).as("timestamp_minus_one_day") // Subtracting one day
)

offsetDF.show(false)

// Output:
// +----------+---------------------+----------------------+
// |timestamp |timestamp_plus_one_day|timestamp_minus_one_day|
// +----------+---------------------+----------------------+
// |1617266400|1617352800000        |1617180000000         |
// |1617352800|1617439200000        |1617266400000         |
// |1617439200|1617525600000        |1617352800000         |
// +----------+---------------------+----------------------+

Extracting Date Parts

You might want to extract specific parts of a date, such as the year, month, day, etc., from a UNIX timestamp. This is done using date formatting functions:


val datePartsDF = readableDF.select(
  $"readable_date",
  year(from_unixtime($"timestamp")).as("year"),
  month(from_unixtime($"timestamp")).as("month"),
  dayofmonth(from_unixtime($"timestamp")).as("day"),
  hour(from_unixtime($"timestamp")).as("hour")
)

datePartsDF.show(false)

// Output:
// +-------------------+----+-----+---+----+
// |readable_date      |year|month|day|hour|
// +-------------------+----+-----+---+----+
// |2021-04-01 00:00:00|2021|4    |1  |0   |
// |2021-04-02 00:00:00|2021|4    |2  |0   |
// |2021-04-03 00:00:00|2021|4    |3  |0   |
// +-------------------+----+-----+---+----+

Conclusion

UNIX timestamps are a fundamental aspect of working with time in computers and databases, and Apache Spark SQL offers robust functions to handle them. By understanding how to convert timestamps to human-readable forms, adjust for time zones, perform date arithmetic, and extract date parts, you can deftly manipulate time-based information in your big data workloads. The straightforward syntax of Spark SQL and Scala, as demonstrated, makes Spark an invaluable tool for these kinds of temporal data operations.

Remember to always consider the context of your data, including factors like time zones, daylight saving changes, and leap seconds, which may affect how you interpret UNIX timestamps in your application. With Spark SQL, you are equipped with a versatile set of tools to ensure accurate and efficient time-related data processing. Happy data wrangling!

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top