To change a DataFrame column from String to Double in PySpark, you can use the `withColumn` method along with the `cast` function from the `pyspark.sql.functions` module. This allows you to transform the data type of a specific column. Below is a detailed explanation and an example to clarify this process.
Example: Changing a DataFrame Column from String to Double in PySpark
First, let’s create a sample DataFrame in PySpark:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Initialize a Spark Session
spark = SparkSession.builder.appName("ChangeColumnType").getOrCreate()
# Sample data
data = [("1", "Alice", "2000.5"), ("2", "Bob", "1500.0"), ("3", "Catherine", "2500.75")]
# Create DataFrame
df = spark.createDataFrame(data, ["id", "name", "salary"])
# Show the DataFrame
df.show()
+---+---------+-------+
| id| name| salary|
+---+---------+-------+
| 1| Alice| 2000.5|
| 2| Bob| 1500.0|
| 3|Catherine|2500.75|
+---+---------+-------+
In the DataFrame above, the `salary` column is of type String. To change this column to Double, use the `withColumn` and `cast` functions:
# Change 'salary' column from String to Double
df = df.withColumn("salary", col("salary").cast("double"))
# Show the modified DataFrame schema and data
df.printSchema()
df.show()
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- salary: double (nullable = true)
+---+---------+------+
| id| name|salary|
+---+---------+------+
| 1| Alice|2000.5|
| 2| Bob|1500.0|
| 3|Catherine|2500.75|
+---+---------+------+
In this example, you can see that the `salary` column has been successfully changed from String to Double. The schema also reflects this change.
Steps Recap:
- Read data into a DataFrame.
- Use the `withColumn` method to transform the column.
- Apply the `cast` function to change the data type of the target column.
This approach ensures that the data type is consistently changed throughout the DataFrame, making it ready for subsequent operations that require numeric types.