How Do You Change a DataFrame Column from String to Double in PySpark?

To change a DataFrame column from String to Double in PySpark, you can use the `withColumn` method along with the `cast` function from the `pyspark.sql.functions` module. This allows you to transform the data type of a specific column. Below is a detailed explanation and an example to clarify this process.

Example: Changing a DataFrame Column from String to Double in PySpark

First, let’s create a sample DataFrame in PySpark:


from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize a Spark Session
spark = SparkSession.builder.appName("ChangeColumnType").getOrCreate()

# Sample data
data = [("1", "Alice", "2000.5"), ("2", "Bob", "1500.0"), ("3", "Catherine", "2500.75")]

# Create DataFrame
df = spark.createDataFrame(data, ["id", "name", "salary"])

# Show the DataFrame
df.show()

+---+---------+-------+
| id|     name| salary|
+---+---------+-------+
|  1|    Alice| 2000.5|
|  2|      Bob| 1500.0|
|  3|Catherine|2500.75|
+---+---------+-------+

In the DataFrame above, the `salary` column is of type String. To change this column to Double, use the `withColumn` and `cast` functions:


# Change 'salary' column from String to Double
df = df.withColumn("salary", col("salary").cast("double"))

# Show the modified DataFrame schema and data
df.printSchema()
df.show()

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- salary: double (nullable = true)

+---+---------+------+
| id|     name|salary|
+---+---------+------+
|  1|    Alice|2000.5|
|  2|      Bob|1500.0|
|  3|Catherine|2500.75|
+---+---------+------+

In this example, you can see that the `salary` column has been successfully changed from String to Double. The schema also reflects this change.

Steps Recap:

  1. Read data into a DataFrame.
  2. Use the `withColumn` method to transform the column.
  3. Apply the `cast` function to change the data type of the target column.

This approach ensures that the data type is consistently changed throughout the DataFrame, making it ready for subsequent operations that require numeric types.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top