To convert a string to a date format in PySpark, you typically use the `to_date` or `to_timestamp` functions available in the `pyspark.sql.functions` module. Here’s how you can do it:
Method 1: Using `to_date` function
The `to_date` function converts a string to a date type without time information.
Example:
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_date
# Initialize SparkSession
spark = SparkSession.builder.appName("String to Date Conversion").getOrCreate()
# Sample data
data = [("2023-10-01",), ("2021-05-12",), ("2019-07-25",)]
columns = ["date_string"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Convert string to date
df = df.withColumn("date", to_date(df["date_string"], "yyyy-MM-dd"))
# Show the DataFrame
df.show()
+-----------+----------+
|date_string| date|
+-----------+----------+
| 2023-10-01|2023-10-01|
| 2021-05-12|2021-05-12|
| 2019-07-25|2019-07-25|
+-----------+----------+
Method 2: Using `to_timestamp` function
The `to_timestamp` function converts a string to a timestamp type, which includes both date and time information.
Example:
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_timestamp
# Initialize SparkSession
spark = SparkSession.builder.appName("String to Timestamp Conversion").getOrCreate()
# Sample data
data = [("2023-10-01 12:45:30",), ("2021-05-12 04:23:50",), ("2019-07-25 19:30:00",)]
columns = ["timestamp_string"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Convert string to timestamp
df = df.withColumn("timestamp", to_timestamp(df["timestamp_string"], "yyyy-MM-dd HH:mm:ss"))
# Show the DataFrame
df.show()
+-------------------+-------------------+
| timestamp_string| timestamp|
+-------------------+-------------------+
|2023-10-01 12:45:30|2023-10-01 12:45:30|
|2021-05-12 04:23:50|2021-05-12 04:23:50|
|2019-07-25 19:30:00|2019-07-25 19:30:00|
+-------------------+-------------------+
Both methods are useful depending on whether you need just the date or both date and time.