How to Transpose Column to Row in Apache Spark?

Transposing columns to rows, or vice versa, is a common data manipulation task. While Apache Spark doesn’t have built-in functions specifically for transposing data, you can achieve this through a combination of existing functions. Let’s look at how to achieve this using PySpark.

Contents hide

1 Example: Transpose Column to Row in PySpark

1.1 Step-by-step Transformation

1.1.1 1. Collect distinct subjects

1.1.2 2. Generate pivoted DataFrame

2 Conclusion

3 About Editorial Team

4 You Might Also Like:

Example: Transpose Column to Row in PySpark

Suppose you have the following DataFrame:


from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit

# Initialize Spark session
spark = SparkSession.builder.appName("TransposeExample").getOrCreate()

# Sample data
data = [
    ("Alice", "Math", 85),
    ("Alice", "English", 78),
    ("Bob", "Math", 56),
    ("Bob", "English", 92)
]

# Create DataFrame
df = spark.createDataFrame(data, ["Name", "Subject", "Score"])

# Show original DataFrame
df.show()


+-----+-------+-----+
| Name|Subject|Score|
+-----+-------+-----+
|Alice|   Math|   85|
|Alice|English|   78|
|  Bob|   Math|   56|
|  Bob|English|   92|
+-----+-------+-----+

Step-by-step Transformation

We want to transpose the data so that subjects become columns and the scores appear under them for each student:

1. Collect distinct subjects


subjects = df.select("Subject").distinct().rdd.flatMap(lambda x: x).collect()

print(subjects)


['Math', 'English']

2. Generate pivoted DataFrame

Use `groupBy` along with the `pivot` function:


pivoted_df = df.groupBy("Name").pivot("Subject").agg(lit(0).alias("dummyValue")).agg({"*": "sum"})
pivoted_df = df.groupBy("Name").pivot("Subject").max("Score")

# Show the transposed DataFrame
pivoted_df.show()


+-----+-------+-------+
| Name|English|   Math|
+-----+-------+-------+
|  Bob|     92|     56|
|Alice|     78|     85|
+-----+-------+-------+

You now have the subjects as columns and the scores as values for each student.

Conclusion

Although Spark doesn’t have direct functions for transposing data, you can achieve the same results using `groupBy`, `pivot`, and aggregate functions. This example in PySpark demonstrates how you can transform rows into columns effectively using these techniques. The same approach can be adapted for Scala, Java, or other programming languages supported by Spark.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.