How to Add a Column with Sum as a New Column in PySpark DataFrame?

To add a column with a sum as a new column in a PySpark DataFrame, you can use various methods such as the `withColumn` function, the `select` function, or using SQL expressions after creating a temporary view. Below are detailed explanations and code examples for each method.

Method 1: Using withColumn()

The `withColumn` method allows you to add a column to a DataFrame by specifying the column name and the expression to populate the column’s values. In this case, we’ll use the `sum` function to calculate the sum of values in an existing column.


from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as _sum

# Initialize Spark Session
spark = SparkSession.builder.appName("AddColumnWithSum").getOrCreate()

# Sample Data
data = [("Alice", 1), ("Bob", 2), ("Cathy", 3)]
columns = ["Name", "Value"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Calculate Sum
total_sum = df.agg(_sum(col("Value")).alias("TotalSum")).collect()[0]["TotalSum"]

# Add Sum as New Column
df_with_sum = df.withColumn("SumColumn", col("Value") + total_sum)

# Show DataFrame
df_with_sum.show()

+-----+-----+---------+
| Name|Value|SumColumn|
+-----+-----+---------+
|Alice|    1|        7|
|  Bob|    2|        8|
|Cathy|    3|        9|
+-----+-----+---------+

Method 2: Using select()

The `select` method allows you to specify explicit columns and their expressions. You can use this method to include the summed values in the resulting DataFrame.


from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as _sum, lit

# Initialize Spark Session
spark = SparkSession.builder.appName("AddColumnWithSum").getOrCreate()

# Sample Data
data = [("Alice", 1), ("Bob", 2), ("Cathy", 3)]
columns = ["Name", "Value"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Calculate Sum
total_sum = df.agg(_sum(col("Value")).alias("TotalSum")).collect()[0]["TotalSum"]

# Add Sum as New Column
df_with_sum = df.select(col("Name"), col("Value"), (col("Value") + lit(total_sum)).alias("SumColumn"))

# Show DataFrame
df_with_sum.show()

+-----+-----+---------+
| Name|Value|SumColumn|
+-----+-----+---------+
|Alice|    1|        7|
|  Bob|    2|        8|
|Cathy|    3|        9|
+-----+-----+---------+

Method 3: Using SQL Queries

Another method is to create a temporary view of the DataFrame and use SQL queries to perform the calculations and add the new column.


from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as _sum

# Initialize Spark Session
spark = SparkSession.builder.appName("AddColumnWithSum").getOrCreate()

# Sample Data
data = [("Alice", 1), ("Bob", 2), ("Cathy", 3)]
columns = ["Name", "Value"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Create Temporary View
df.createOrReplaceTempView("sample_table")

# Calculate Sum and Add as New Column
df_with_sum = spark.sql("""
SELECT Name, Value, (Value + (SELECT SUM(Value) FROM sample_table)) as SumColumn 
FROM sample_table
""")

# Show DataFrame
df_with_sum.show()

+-----+-----+---------+
| Name|Value|SumColumn|
+-----+-----+---------+
|Alice|    1|        7|
|  Bob|    2|        8|
|Cathy|    3|        9|
+-----+-----+---------+

Each of these methods allows you to effectively add a column with summed values to a PySpark DataFrame. The choice of method can depend on your specific use case and preference for either DataFrame API or SQL queries.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top