Flattening a struct in a Spark DataFrame refers to converting the nested fields of a struct into individual columns. This can be particularly useful when dealing with deeply nested JSON data, where you want to work with a flat schema. Below, I will show you how to flatten a struct in a Spark DataFrame using PySpark.
Flattening a Struct in a Spark DataFrame
Example using PySpark
Let’s assume you have the following nested DataFrame:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Create Spark session
spark = SparkSession.builder.appName("Flatten Struct Example").getOrCreate()
# Sample nested data
data = [
(1, {"name": "John", "age": 30}),
(2, {"name": "Jane", "age": 25})
]
# Define schema
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
StructField("id", IntegerType(), True),
StructField("info", StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
]), True)
])
# Create DataFrame
df = spark.createDataFrame(data, schema)
df.show()
The output of the above code will be:
+---+------------+
| id| info|
+---+------------+
| 1|{John, 30} |
| 2|{Jane, 25} |
+---+------------+
Now, let’s flatten the `info` struct into separate columns.
# Flatten the struct
flat_df = df.select(
col("id"),
col("info.name").alias("name"),
col("info.age").alias("age")
)
flat_df.show()
The output of the flattened DataFrame will be:
+---+----+---+
| id|name|age|
+---+----+---+
| 1|John| 30|
| 2|Jane| 25|
+---+----+---+
Explanation
1. First, we create a Spark session and define some sample nested data.
2. We define the schema to include a nested struct.
3. We create the original nested DataFrame using the sample data and schema.
4. We use the `select` method along with column expressions (`col(“info.name”).alias(“name”)` and `col(“info.age”).alias(“age”)`) to flatten the structure into individual columns.
Considerations
It’s worth noting that for deeply nested structures, you may need to recursively flatten each level. The above example assumes a single level of nesting.