In PySpark, the `explode` function is commonly used to transform a column containing arrays or maps into multiple rows, where each array element or map key-value pair becomes its own row. When you need to explode multiple columns, you have to apply the `explode` operation on each column sequentially. Below is an example showcasing how to achieve this in PySpark using a DataFrame.
Exploding Multiple Columns in a PySpark DataFrame
Let’s start by creating a sample PySpark DataFrame that we will use for our example. Assume that our DataFrame has two columns, both containing arrays, and we want to explode both of these columns.
Step 1: Creating the Sample DataFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
# Create Spark session
spark = SparkSession.builder.appName("ExplodeExample").getOrCreate()
# Sample data
data = [
(1, ["a", "b", "c"], [1, 2, 3]),
(2, ["d", "e"], [4, 5]),
(3, ["f"], [6])
]
# Creating DataFrame
df = spark.createDataFrame(data, ["id", "letters", "numbers"])
df.show(truncate=False)
+---+---------+---------+
|id |letters |numbers |
+---+---------+---------+
|1 |[a, b, c]|[1, 2, 3]|
|2 |[d, e] |[4, 5] |
|3 |[f] |[6] |
+---+---------+---------+
Step 2: Exploding the Columns Sequentially
In this step, we will apply the `explode` function to both columns. Note that we need to explode each column sequentially in separate transformations.
# Exploding the 'letters' column
df_exploded_letters = df.withColumn("letter", explode("letters"))
df_exploded_letters.show(truncate=False)
+---+---------+---------+------+
|id |letters |numbers |letter|
+---+---------+---------+------+
|1 |[a, b, c]|[1, 2, 3]|a |
|1 |[a, b, c]|[1, 2, 3]|b |
|1 |[a, b, c]|[1, 2, 3]|c |
|2 |[d, e] |[4, 5] |d |
|2 |[d, e] |[4, 5] |e |
|3 |[f] |[6] |f |
+---+---------+---------+------+
# Exploding the 'numbers' column
df_exploded_both = df_exploded_letters.withColumn("number", explode("numbers"))
df_exploded_both.show(truncate=False)
+---+---------+---------+------+------+
|id |letters |numbers |letter|number|
+---+---------+---------+------+------+
|1 |[a, b, c]|[1, 2, 3]|a |1 |
|1 |[a, b, c]|[1, 2, 3]|a |2 |
|1 |[a, b, c]|[1, 2, 3]|a |3 |
|1 |[a, b, c]|[1, 2, 3]|b |1 |
|1 |[a, b, c]|[1, 2, 3]|b |2 |
|1 |[a, b, c]|[1, 2, 3]|b |3 |
|1 |[a, b, c]|[1, 2, 3]|c |1 |
|1 |[a, b, c]|[1, 2, 3]|c |2 |
|1 |[a, b, c]|[1, 2, 3]|c |3 |
|2 |[d, e] |[4, 5] |d |4 |
|2 |[d, e] |[4, 5] |d |5 |
|2 |[d, e] |[4, 5] |e |4 |
|2 |[d, e] |[4, 5] |e |5 |
|3 |[f] |[6] |f |6 |
+---+---------+---------+------+------+
As you can see, the resulting DataFrame `df_exploded_both` is created by exploding both the ‘letters’ and ‘numbers’ columns. Note that the number of rows in the final DataFrame is the product of the lengths of the input arrays.
Conclusion
Exploding multiple columns in a PySpark DataFrame involves sequentially applying the `explode` function to each column. By following the steps above, you can effectively transform array or map columns into multiple rows in your DataFrame.