How to Flatten a Struct in a Spark DataFrame?

Flattening a struct in a Spark DataFrame refers to converting the nested fields of a struct into individual columns. This can be particularly useful when dealing with deeply nested JSON data, where you want to work with a flat schema. Below, I will show you how to flatten a struct in a Spark DataFrame using PySpark.

Contents hide

1 Flattening a Struct in a Spark DataFrame

1.1 Example using PySpark

1.2 Explanation

1.3 Considerations

2 About Editorial Team

3 You Might Also Like:

Flattening a Struct in a Spark DataFrame

Example using PySpark

Let’s assume you have the following nested DataFrame:


from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Create Spark session
spark = SparkSession.builder.appName("Flatten Struct Example").getOrCreate()

# Sample nested data
data = [
    (1, {"name": "John", "age": 30}),
    (2, {"name": "Jane", "age": 25})
]

# Define schema
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("info", StructType([
        StructField("name", StringType(), True),
        StructField("age", IntegerType(), True)
    ]), True)
])

# Create DataFrame
df = spark.createDataFrame(data, schema)
df.show()

The output of the above code will be:


+---+------------+
| id|        info|
+---+------------+
|  1|{John, 30}  |
|  2|{Jane, 25}  |
+---+------------+

Now, let’s flatten the `info` struct into separate columns.


# Flatten the struct
flat_df = df.select(
    col("id"),
    col("info.name").alias("name"),
    col("info.age").alias("age")
)

flat_df.show()

The output of the flattened DataFrame will be:


+---+----+---+
| id|name|age|
+---+----+---+
|  1|John| 30|
|  2|Jane| 25|
+---+----+---+

Explanation

1. First, we create a Spark session and define some sample nested data.
2. We define the schema to include a nested struct.
3. We create the original nested DataFrame using the sample data and schema.
4. We use the `select` method along with column expressions (`col(“info.name”).alias(“name”)` and `col(“info.age”).alias(“age”)`) to flatten the structure into individual columns.

Considerations

It’s worth noting that for deeply nested structures, you may need to recursively flatten each level. The above example assumes a single level of nesting.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Flattening a Struct in a Spark DataFrame

Example using PySpark

Explanation

Considerations

About Editorial Team

You Might Also Like:

Leave a Comment Cancel Reply