How to Convert PySpark String to Date Format?

To convert a string to a date format in PySpark, you typically use the `to_date` or `to_timestamp` functions available in the `pyspark.sql.functions` module. Here’s how you can do it:

Method 1: Using `to_date` function

The `to_date` function converts a string to a date type without time information.

Example:


from pyspark.sql import SparkSession
from pyspark.sql.functions import to_date

# Initialize SparkSession
spark = SparkSession.builder.appName("String to Date Conversion").getOrCreate()

# Sample data
data = [("2023-10-01",), ("2021-05-12",), ("2019-07-25",)]
columns = ["date_string"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Convert string to date
df = df.withColumn("date", to_date(df["date_string"], "yyyy-MM-dd"))

# Show the DataFrame
df.show()

+-----------+----------+
|date_string|      date|
+-----------+----------+
| 2023-10-01|2023-10-01|
| 2021-05-12|2021-05-12|
| 2019-07-25|2019-07-25|
+-----------+----------+

Method 2: Using `to_timestamp` function

The `to_timestamp` function converts a string to a timestamp type, which includes both date and time information.

Example:


from pyspark.sql import SparkSession
from pyspark.sql.functions import to_timestamp

# Initialize SparkSession
spark = SparkSession.builder.appName("String to Timestamp Conversion").getOrCreate()

# Sample data
data = [("2023-10-01 12:45:30",), ("2021-05-12 04:23:50",), ("2019-07-25 19:30:00",)]
columns = ["timestamp_string"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Convert string to timestamp
df = df.withColumn("timestamp", to_timestamp(df["timestamp_string"], "yyyy-MM-dd HH:mm:ss"))

# Show the DataFrame
df.show()

+-------------------+-------------------+
|   timestamp_string|          timestamp|
+-------------------+-------------------+
|2023-10-01 12:45:30|2023-10-01 12:45:30|
|2021-05-12 04:23:50|2021-05-12 04:23:50|
|2019-07-25 19:30:00|2019-07-25 19:30:00|
+-------------------+-------------------+

Both methods are useful depending on whether you need just the date or both date and time.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top