How Can I Change Column Types in Spark SQL’s DataFrame?

Changing column types in Spark SQL’s DataFrame can be easily achieved using the `withColumn` method in combination with the `cast` function. This method is very handy when you need to ensure that the column types are appropriate for your analysis or processing. Below are examples in both PySpark and Scala.

Changing Column Types in PySpark

Here’s an example that demonstrates how to change the type of a column from String to Integer.


from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Create Spark session
spark = SparkSession.builder.appName("ChangeColumnType").getOrCreate()

# Sample data
data = [("James", "34"), ("Michael", "56"), ("Robert", "12")]
columns = ["Name", "Age"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Print schema before type conversion
df.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)

# Convert 'Age' column from String to Integer
df = df.withColumn("Age", col("Age").cast("int"))

# Print schema after type conversion
df.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)

Output of displaying the data:


df.show()

+-------+---+
|   Name|Age|
+-------+---+
|  James| 34|
|Michael| 56|
| Robert| 12|
+-------+---+

Changing Column Types in Scala

Below is a similar example in Scala, where we change the column type of “Age” from String to Integer.


import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession.builder.appName("ChangeColumnType").getOrCreate()

// Sample Data
val data = Seq(("James", "34"), ("Michael", "56"), ("Robert", "12"))
val columns = Seq("Name", "Age")

// Create DataFrame
import spark.implicits._
val df = data.toDF(columns: _*)

// Print schema before type conversion
df.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)

// Convert 'Age' column from String to Integer
val df2 = df.withColumn("Age", col("Age").cast("int"))

// Print schema after type conversion
df2.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)

Output of displaying the data:


df2.show()

+-------+---+
|   Name|Age|
+-------+---+
|  James| 34|
|Michael| 56|
| Robert| 12|
+-------+---+

In both examples above, we first create a DataFrame with sample data. Before converting the column type, we print the schema to observe the initial types. We then use the `withColumn` method along with `cast` to change the type of the ‘Age’ column from String to Integer. Finally, we print the schema and display the data to confirm the change.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top