How to Concatenate Columns in Apache Spark DataFrame?

Concatenating columns in an Apache Spark DataFrame can be done using various methods depending on the programming language you are using. Here, I’ll illustrate how to concatenate columns using PySpark and Scala. These examples will show you how to combine two or more columns into a new single column.

Contents hide

3 About Editorial Team

4 You Might Also Like:

Using PySpark

In PySpark, you can use the `concat` function available in the `pyspark.sql.functions` module. You may also need to `lit` function if you want to add any delimiter like a comma or space between the values of the columns.

Example


from pyspark.sql import SparkSession
from pyspark.sql.functions import concat, col, lit

# Create a Spark session
spark = SparkSession.builder.appName("ConcatenateColumns").getOrCreate()

# Sample DataFrame
data = [("John", "Doe"), ("Jane", "Smith"), ("Michael", "Brown")]
columns = ["First_Name", "Last_Name"]

df = spark.createDataFrame(data, columns)

# Concatenate Columns
df = df.withColumn("Full_Name", concat(col("First_Name"), lit(" "), col("Last_Name")))

df.show()

Output


+----------+---------+-----------+
|First_Name|Last_Name|  Full_Name|
+----------+---------+-----------+
|      John|      Doe|   John Doe|
|      Jane|    Smith| Jane Smith|
|   Michael|    Brown|Michael Brown|
+----------+---------+-----------+

Using Scala

In Scala, you can achieve the column concatenation using `concat` function from `org.apache.spark.sql.functions`. You can use the `$` notation to reference column names.

Example


import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession.builder.appName("ConcatenateColumns").getOrCreate()

import spark.implicits._

// Sample DataFrame
val data = Seq(("John", "Doe"), ("Jane", "Smith"), ("Michael", "Brown"))
val df = data.toDF("First_Name", "Last_Name")

// Concatenate Columns
val dfWithFullName = df.withColumn("Full_Name", concat($"First_Name", lit(" "), $"Last_Name"))

dfWithFullName.show()

Output


+----------+---------+-------------+
|First_Name|Last_Name|    Full_Name|
+----------+---------+-------------+
|      John|      Doe|     John Doe|
|      Jane|    Smith|   Jane Smith|
|   Michael|    Brown|Michael Brown|
+----------+---------+-------------+

These examples demonstrate how to concatenate columns in a DataFrame using PySpark and Scala. The approaches are quite similar, with minimal syntax differences between the two languages.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Using PySpark

Example

Output

Using Scala

Example

Output

About Editorial Team

You Might Also Like:

Leave a Comment Cancel Reply