The task of extracting the year, month, and day from a `TimestampType` column in an Apache Spark DataFrame can be efficiently handled using built-in functions in Spark SQL. Below, I will provide detailed explanations and examples using PySpark, Scala, and Java.
Using PySpark
In PySpark, the `year()`, `month()`, and `dayofmonth()` functions are used to extract the year, month, and day from a `TimestampType` column.
from pyspark.sql import SparkSession
from pyspark.sql.functions import year, month, dayofmonth
# Initialize SparkSession
spark = SparkSession.builder.master("local").appName("ExtractDatetime").getOrCreate()
# Sample data
data = [("2023-10-05 12:30:00",), ("2022-06-01 08:45:00",)]
columns = ["timestamp"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Extract Year, Month, and Day
df = df.withColumn("year", year(df["timestamp"])) \
.withColumn("month", month(df["timestamp"])) \
.withColumn("day", dayofmonth(df["timestamp"]))
df.show()
Using Scala
In Scala, we use similar functions provided by the `org.apache.spark.sql.functions` package to extract the year, month, and day.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{year, month, dayofmonth}
// Initialize SparkSession
val spark = SparkSession.builder.master("local").appName("ExtractDatetime").getOrCreate()
// Sample data
val data = Seq(("2023-10-05 12:30:00"), ("2022-06-01 08:45:00"))
val columns = Seq("timestamp")
// Create DataFrame
import spark.implicits._
val df = data.toDF(columns: _*)
// Extract Year, Month, and Day
val resultDF = df.withColumn("year", year($"timestamp"))
.withColumn("month", month($"timestamp"))
.withColumn("day", dayofmonth($"timestamp"))
resultDF.show()
Using Java
In Java, the process is slightly more verbose, but the methodology remains the same. We use the `org.apache.spark.sql.functions` package.
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import static org.apache.spark.sql.functions.*;
public class ExtractDatetime {
public static void main(String[] args) {
// Initialize SparkSession
SparkSession spark = SparkSession.builder().master("local").appName("ExtractDatetime").getOrCreate();
// Sample data
List<Row> data = Arrays.asList(
RowFactory.create("2023-10-05 12:30:00"),
RowFactory.create("2022-06-01 08:45:00")
);
StructType schema = new StructType(new StructField[] {
new StructField("timestamp", DataTypes.StringType, false, Metadata.empty())
});
// Create DataFrame
Dataset<Row> df = spark.createDataFrame(data, schema);
// Extract Year, Month, and Day
df = df.withColumn("year", year(col("timestamp")))
.withColumn("month", month(col("timestamp")))
.withColumn("day", dayofmonth(col("timestamp")));
df.show();
}
}
In these examples, we first created a DataFrame with sample timestamp data. Using Spark SQL functions like `year()`, `month()`, and `dayofmonth()`, we extracted the relevant date parts and added them as new columns to the DataFrame.