How to Apply UDFs on Grouped Data in PySpark: A Step-by-Step Python Example

Using User-Defined Functions (UDFs) in PySpark allows you to create custom functions that can be used in Spark transformations. Applying UDFs on grouped data involves a few steps: defining the UDF, registering it, and then applying it to the grouped data. Below is a step-by-step example in Python using PySpark.

Step 1: Setting Up PySpark

First, you’ll need to set up the Spark session:


from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder \
    .appName("Apply UDF on Grouped Data") \
    .getOrCreate()

Step 2: Sample Data

Next, we create a sample DataFrame:


from pyspark.sql import Row

data = [
    Row(group="A", value=10),
    Row(group="A", value=20),
    Row(group="B", value=30),
    Row(group="B", value=40)
]

df = spark.createDataFrame(data)
df.show()

The code above will output:


+-----+-----+
|group|value|
+-----+-----+
|    A|   10|
|    A|   20|
|    B|   30|
|    B|   40|
+-----+-----+

Step 3: Defining the UDF

Now, define the UDF. For this example, let’s create a UDF that calculates the square of the sum of values in each group:


from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

# Define the UDF
def square_sum(values):
    return sum(values) ** 2

# Register the UDF
square_sum_udf = udf(square_sum, IntegerType())

Step 4: Applying the UDF to Grouped Data

Group the DataFrame by the “group” column and apply the UDF:


from pyspark.sql.functions import collect_list

# Group by 'group' and collect 'value' into a list
grouped_df = df.groupBy("group").agg(collect_list("value").alias("values"))

# Apply the UDF
result_df = grouped_df.withColumn("square_sum", square_sum_udf("values"))

result_df.show()

The code above will output:


+-----+--------+----------+
|group|  values|square_sum|
+-----+--------+----------+
|    B|[30, 40]|      4900|
|    A|[10, 20]|       900|
+-----+--------+----------+

Conclusion

In this example, you learned how to apply UDFs on grouped data in PySpark. By following these steps, you can define a UDF, group your DataFrame, and apply the UDF to aggregated columns, enabling you to perform customized transformations on your data.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top