How to Flatten a Struct in a Spark DataFrame?

Flattening a struct in a Spark DataFrame refers to converting the nested fields of a struct into individual columns. This can be particularly useful when dealing with deeply nested JSON data, where you want to work with a flat schema. Below, I will show you how to flatten a struct in a Spark DataFrame using PySpark.

Flattening a Struct in a Spark DataFrame

Example using PySpark

Let’s assume you have the following nested DataFrame:


from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Create Spark session
spark = SparkSession.builder.appName("Flatten Struct Example").getOrCreate()

# Sample nested data
data = [
    (1, {"name": "John", "age": 30}),
    (2, {"name": "Jane", "age": 25})
]

# Define schema
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("info", StructType([
        StructField("name", StringType(), True),
        StructField("age", IntegerType(), True)
    ]), True)
])

# Create DataFrame
df = spark.createDataFrame(data, schema)
df.show()

The output of the above code will be:


+---+------------+
| id|        info|
+---+------------+
|  1|{John, 30}  |
|  2|{Jane, 25}  |
+---+------------+

Now, let’s flatten the `info` struct into separate columns.


# Flatten the struct
flat_df = df.select(
    col("id"),
    col("info.name").alias("name"),
    col("info.age").alias("age")
)

flat_df.show()

The output of the flattened DataFrame will be:


+---+----+---+
| id|name|age|
+---+----+---+
|  1|John| 30|
|  2|Jane| 25|
+---+----+---+

Explanation

1. First, we create a Spark session and define some sample nested data.
2. We define the schema to include a nested struct.
3. We create the original nested DataFrame using the sample data and schema.
4. We use the `select` method along with column expressions (`col(“info.name”).alias(“name”)` and `col(“info.age”).alias(“age”)`) to flatten the structure into individual columns.

Considerations

It’s worth noting that for deeply nested structures, you may need to recursively flatten each level. The above example assumes a single level of nesting.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top