How to Concatenate Columns in Apache Spark DataFrame?

Concatenating columns in an Apache Spark DataFrame can be done using various methods depending on the programming language you are using. Here, I’ll illustrate how to concatenate columns using PySpark and Scala. These examples will show you how to combine two or more columns into a new single column.

Using PySpark

In PySpark, you can use the `concat` function available in the `pyspark.sql.functions` module. You may also need to `lit` function if you want to add any delimiter like a comma or space between the values of the columns.

Example


from pyspark.sql import SparkSession
from pyspark.sql.functions import concat, col, lit

# Create a Spark session
spark = SparkSession.builder.appName("ConcatenateColumns").getOrCreate()

# Sample DataFrame
data = [("John", "Doe"), ("Jane", "Smith"), ("Michael", "Brown")]
columns = ["First_Name", "Last_Name"]

df = spark.createDataFrame(data, columns)

# Concatenate Columns
df = df.withColumn("Full_Name", concat(col("First_Name"), lit(" "), col("Last_Name")))

df.show()

Output


+----------+---------+-----------+
|First_Name|Last_Name|  Full_Name|
+----------+---------+-----------+
|      John|      Doe|   John Doe|
|      Jane|    Smith| Jane Smith|
|   Michael|    Brown|Michael Brown|
+----------+---------+-----------+

Using Scala

In Scala, you can achieve the column concatenation using `concat` function from `org.apache.spark.sql.functions`. You can use the `$` notation to reference column names.

Example


import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession.builder.appName("ConcatenateColumns").getOrCreate()

import spark.implicits._

// Sample DataFrame
val data = Seq(("John", "Doe"), ("Jane", "Smith"), ("Michael", "Brown"))
val df = data.toDF("First_Name", "Last_Name")

// Concatenate Columns
val dfWithFullName = df.withColumn("Full_Name", concat($"First_Name", lit(" "), $"Last_Name"))

dfWithFullName.show()

Output


+----------+---------+-------------+
|First_Name|Last_Name|    Full_Name|
+----------+---------+-------------+
|      John|      Doe|     John Doe|
|      Jane|    Smith|   Jane Smith|
|   Michael|    Brown|Michael Brown|
+----------+---------+-------------+

These examples demonstrate how to concatenate columns in a DataFrame using PySpark and Scala. The approaches are quite similar, with minimal syntax differences between the two languages.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top