How to Perform Cumulative Sum by Group in Python Spark DataFrame?

To perform a cumulative sum by group in a PySpark DataFrame, we can use the `Window` function along with `cumsum()`. This allows us to partition the data by a specific group and then perform the cumulative sum within each group. Below is an example to demonstrate how this can be done in PySpark.

Step-by-Step Process

1. Setup and Sample Data

First, let’s import the necessary libraries and create a sample DataFrame.


from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Initialize Spark session
spark = SparkSession.builder.appName("Cumulative Sum by Group").getOrCreate()

# Create sample data
data = [
    ("A", 1),
    ("A", 2),
    ("A", 3),
    ("B", 2),
    ("B", 3),
    ("B", 4)
]

# Create DataFrame
df = spark.createDataFrame(data, ["Group", "Value"])

df.show()

+-----+-----+
|Group|Value|
+-----+-----+
|    A|    1|
|    A|    2|
|    A|    3|
|    B|    2|
|    B|    3|
|    B|    4|
+-----+-----+

2. Define the Window Spec

Next, we define a window specification to partition the DataFrame by the “Group” column and order by some other column if necessary.


# Define the window spec
window_spec = Window.partitionBy("Group").orderBy("Value").rowsBetween(Window.unboundedPreceding, Window.currentRow)

3. Apply Cumulative Sum Function

We then use the `F.sum()` function over the defined window to calculate the cumulative sum by group.


# Apply cumulative sum function
df_with_cumsum = df.withColumn("Cumulative_Sum", F.sum("Value").over(window_spec))

df_with_cumsum.show()

+-----+-----+--------------+
|Group|Value|Cumulative_Sum|
+-----+-----+--------------+
|    A|    1|             1|
|    A|    2|             3|
|    A|    3|             6|
|    B|    2|             2|
|    B|    3|             5|
|    B|    4|             9|
+-----+-----+--------------+

Explanation

1. Window.partitionBy("Group").orderBy("Value"): Creates a window specification that partitions the DataFrame by the “Group” column and orders the rows within each group by the “Value” column.

2. F.sum("Value").over(window_spec): Calculates the cumulative sum of the “Value” column over the specified window.

3. withColumn("Cumulative_Sum", ...): Adds the result as a new column called “Cumulative_Sum” to the DataFrame.

This approach ensures that the cumulative sum is calculated correctly for each group individually, based on the order of values within the group.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top