Using User-Defined Functions (UDFs) in PySpark allows you to create custom functions that can be used in Spark transformations. Applying UDFs on grouped data involves a few steps: defining the UDF, registering it, and then applying it to the grouped data. Below is a step-by-step example in Python using PySpark.
Step 1: Setting Up PySpark
First, you’ll need to set up the Spark session:
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder \
.appName("Apply UDF on Grouped Data") \
.getOrCreate()
Step 2: Sample Data
Next, we create a sample DataFrame:
from pyspark.sql import Row
data = [
Row(group="A", value=10),
Row(group="A", value=20),
Row(group="B", value=30),
Row(group="B", value=40)
]
df = spark.createDataFrame(data)
df.show()
The code above will output:
+-----+-----+
|group|value|
+-----+-----+
| A| 10|
| A| 20|
| B| 30|
| B| 40|
+-----+-----+
Step 3: Defining the UDF
Now, define the UDF. For this example, let’s create a UDF that calculates the square of the sum of values in each group:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
# Define the UDF
def square_sum(values):
return sum(values) ** 2
# Register the UDF
square_sum_udf = udf(square_sum, IntegerType())
Step 4: Applying the UDF to Grouped Data
Group the DataFrame by the “group” column and apply the UDF:
from pyspark.sql.functions import collect_list
# Group by 'group' and collect 'value' into a list
grouped_df = df.groupBy("group").agg(collect_list("value").alias("values"))
# Apply the UDF
result_df = grouped_df.withColumn("square_sum", square_sum_udf("values"))
result_df.show()
The code above will output:
+-----+--------+----------+
|group| values|square_sum|
+-----+--------+----------+
| B|[30, 40]| 4900|
| A|[10, 20]| 900|
+-----+--------+----------+
Conclusion
In this example, you learned how to apply UDFs on grouped data in PySpark. By following these steps, you can define a UDF, group your DataFrame, and apply the UDF to aggregated columns, enabling you to perform customized transformations on your data.