Pivoting a string column on a PySpark DataFrame involves transforming unique values from that column into multiple columns. This is often used for reshaping data where observations are spread across multiple rows to a single row per entity with various features. Below is an example of how to achieve this in PySpark.
Pivoting a String Column using PySpark
Let’s start with a sample DataFrame:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
# Initialize Spark Session
spark = SparkSession.builder.appName("PivotExample").getOrCreate()
# Sample Data
data = [("A", "category1", 10),
("A", "category2", 20),
("B", "category1", 30),
("B", "category3", 40)]
# Create DataFrame
df = spark.createDataFrame(data, ["id", "category", "value"])
# Show sample DataFrame
df.show()
+---+---------+-----+
| id| category|value|
+---+---------+-----+
| A|category1| 10|
| A|category2| 20|
| B|category1| 30|
| B|category3| 40|
+---+---------+-----+
We aim to pivot the `category` column, turning its unique values into separate columns. Here’s how to do it:
# Perform Pivot
pivot_df = df.groupBy("id").pivot("category").sum("value")
# Show Pivoted DataFrame
pivot_df.show()
+---+---------+---------+---------+
| id|category1|category2|category3|
+---+---------+---------+---------+
| B| 30| null| 40|
| A| 10| 20| null|
+---+---------+---------+---------+
Here is a step-by-step breakdown of the process:
Step 1: Create DataFrame
We start by initializing a Spark session and creating a sample DataFrame with columns “id”, “category”, and “value”. In this case, “id” is the identifier, “category” is the column we want to pivot, and “value” is the measure to aggregate.
Step 2: Perform Pivot
We use the `groupBy(“id”)` method to group data by “id” column and then call the `pivot(“category”)` method to pivot the “category” column. Finally, we use the `sum(“value”)` method to sum values for each pivot.
Step 3: Display Pivoted DataFrame
Finally, we display the pivoted DataFrame, where each unique value in the “category” column becomes a new column, with aggregated values as cell contents.
To summarize, pivoting a string column on a PySpark DataFrame involves grouping by a key column, pivoting the desired column, and applying an aggregation function (like sum) to fill in the new columns. This technique is especially useful for reshaping data to suit various analytical needs.