How to Pivot a String Column on PySpark DataFrame?

Pivoting a string column on a PySpark DataFrame involves transforming unique values from that column into multiple columns. This is often used for reshaping data where observations are spread across multiple rows to a single row per entity with various features. Below is an example of how to achieve this in PySpark.

Pivoting a String Column using PySpark

Let’s start with a sample DataFrame:


from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

# Initialize Spark Session
spark = SparkSession.builder.appName("PivotExample").getOrCreate()

# Sample Data
data = [("A", "category1", 10),
        ("A", "category2", 20),
        ("B", "category1", 30),
        ("B", "category3", 40)]

# Create DataFrame
df = spark.createDataFrame(data, ["id", "category", "value"])

# Show sample DataFrame
df.show()

+---+---------+-----+
| id| category|value|
+---+---------+-----+
|  A|category1|   10|
|  A|category2|   20|
|  B|category1|   30|
|  B|category3|   40|
+---+---------+-----+

We aim to pivot the `category` column, turning its unique values into separate columns. Here’s how to do it:


# Perform Pivot
pivot_df = df.groupBy("id").pivot("category").sum("value")

# Show Pivoted DataFrame
pivot_df.show()

+---+---------+---------+---------+
| id|category1|category2|category3|
+---+---------+---------+---------+
|  B|       30|     null|       40|
|  A|       10|       20|     null|
+---+---------+---------+---------+

Here is a step-by-step breakdown of the process:

Step 1: Create DataFrame

We start by initializing a Spark session and creating a sample DataFrame with columns “id”, “category”, and “value”. In this case, “id” is the identifier, “category” is the column we want to pivot, and “value” is the measure to aggregate.

Step 2: Perform Pivot

We use the `groupBy(“id”)` method to group data by “id” column and then call the `pivot(“category”)` method to pivot the “category” column. Finally, we use the `sum(“value”)` method to sum values for each pivot.

Step 3: Display Pivoted DataFrame

Finally, we display the pivoted DataFrame, where each unique value in the “category” column becomes a new column, with aggregated values as cell contents.

To summarize, pivoting a string column on a PySpark DataFrame involves grouping by a key column, pivoting the desired column, and applying an aggregation function (like sum) to fill in the new columns. This technique is especially useful for reshaping data to suit various analytical needs.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top