Selecting Nested Struct Columns in PySpark

One of the features that makes PySpark stand out is its ability to handle complex, nested data structures, such as JSON files, through DataFrame APIs. In this guide, we will explore how to select nested struct columns in PySpark, which is a common scenario when dealing with semi-structured data.

Understanding PySpark DataFrames

Before diving into the selection of nested struct columns, it’s essential to have a good understanding of PySpark DataFrames. A DataFrame in PySpark is a distributed collection of rows under named columns, which is conceptually equivalent to a table in a relational database but with richer optimizations. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs (Resilient Distributed Datasets).

Nested struct fields in DataFrames are akin to columns within columns, which is a structure reminiscent of JSON objects. Each struct represents a complex column that can have its own subfields, sometimes with additional nesting. Handling this kind of data requires specific methods to effectively extract and manipulate these nested fields.

Creating a DataFrame with Nested Struct Columns

Setting Up the SparkSession

The first step when working with PySpark is to initialize a SparkSession. The SparkSession is the entry point to programming with DataFrame and Dataset APIs. Here’s how you can start a SparkSession:


from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("Nested Columns Example") \
    .getOrCreate()

Defining a DataFrame with Nested Structs

Let’s create a DataFrame with nested struct columns to use in our examples. We will define a simple schema that mimics a user record with nested details about the user’s address.


from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define a schema with nested struct
schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("address", StructType([
        StructField("street", StringType(), True),
        StructField("city", StringType(), True)
    ]), True)
])

# Create data with nested structure
data = [("James", 34, ("1st Avenue", "New York")),
        ("Anna", 23, ("2nd Avenue", "San Francisco")),
        ("Jeff", 45, ("3rd Avenue", "Los Angeles"))]

# Create a DataFrame using the schema defined
df = spark.createDataFrame(data, schema)

df.show()

Running the above code, you would see the following output, displaying the DataFrame with a nested “address” struct column:


+-----+---+---------------------------+
| name|age|                    address|
+-----+---+---------------------------+
|James| 34|     {1st Avenue, New York}|
| Anna| 23|{2nd Avenue, San Francisco}|
| Jeff| 45|  {3rd Avenue, Los Angeles}|
+-----+---+---------------------------+

Selecting and Accessing Nested Struct Columns

Selecting The Entire Struct Column

To select the entire nested struct column, you can do so just like any other column. Here’s an example:


# Selecting the entire nested 'address' column
df.select("address").show()

# Output

The output of the above code snippet will be:


+---------------------------+
|                    address|
+---------------------------+
|     {1st Avenue, New York}|
|{2nd Avenue, San Francisco}|
|  {3rd Avenue, Los Angeles}|
+---------------------------+

Selecting Specific Subfields of a Struct Column

More often, one needs to select specific fields within a nested struct. You can do this by referencing the subfields with a dot (.) as shown below:


# Selecting 'street' subfield within the 'address' struct column
df.select("address.street").show()

# Output

The output will look like this:


+------------+
|street      |
+------------+
|1st Avenue  |
|2nd Avenue  |
|3rd Avenue  |
+------------+

Alias for Nested Columns

Sometimes it is useful to alias the nested columns, especially for clarity or when you want to flatten the structure. Use the ‘alias’ function to rename the columns in the output:


# Selecting and aliasing 'city' subfield within the 'address' struct column
df.select(df["address.city"].alias("city")).show()

# Output

This would produce the following output:


+-------------+
|city         |
+-------------+
|New York     |
|San Francisco|
|Los Angeles  |
+-------------+

Complex Nested Structures

Dealing With Multiple Levels of Nesting

Sometimes you might have multiple levels of nested struct columns. In this scenario, you can continue to use the dot notation to navigate down the levels. Assume we have a dataframe `df_nested` with a more complex structure like this:


# Assume 'df_nested' has the following complex nested structure with 'contact' nested within 'address'
# Defined schema for illustration purposes:
complex_schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("address", StructType([
        StructField("street", StringType(), True),
        StructField("city", StringType(), True),
        StructField("contact", StructType([
            StructField("phone", StringType(), True),
            StructField("email", StringType(), True)
        ]), True)
    ]), True)
])

If you want to select the ‘phone’ field within the ‘contact’ struct nested inside the ‘address’ struct, you would do the following:


# Navigating through multiple levels and selecting 'phone'
df_nested.select("address.contact.phone").show()

# Output placeholder, as 'df_nested' is only conceptual here.

If this code were run, an output might look like this, assuming appropriate data in `df_nested`:


+------------+
|phone       |
+------------+
|123-456-7890|
|321-654-0987|
|456-123-8765|
+------------+

Handling Nested Data in Practice

Exploding Nested Arrays

When working with arrays inside struct columns, you might need to “explode” them into multiple rows. This can be achieved using the `explode` function:


from pyspark.sql.functions import explode

# Assuming 'df_array' contains an 'interests' array inside the 'details' struct.
# The explode function is used here for illustration:

df_array.select(explode(df_array["details.interests"]).alias("interest")).show()

# You would see an output displaying each interest in a separate row.

Using `selectExpr` for More Complex Selections

For more complex selections or to execute SQL expressions on DataFrame columns, you can use the `selectExpr` method. It can be particularly handy for accessing nested structs:


# Using `selectExpr` to access nested struct fields
df.selectExpr("address.street as street", "address.city as city").show()

# Output

The above snippet would produce a flattened structure with the street and city as separate columns:


+------------+-------------+
|street      |city         |
+------------+-------------+
|1st Avenue  |New York     |
|2nd Avenue  |San Francisco|
|3rd Avenue  |Los Angeles  |
+------------+-------------+

Conclusion

Working with nested struct columns in PySpark can seem daunting at first, but it’s a powerful feature that, when mastered, can enable you to manipulate and analyze complex data structures effectively. With a strong grasp on PySpark’s DataFrame API and the use of structured columns, you can unlock insightful analysis and derive meaningful interpretations from intricate datasets.

We have covered initializing a SparkSession, creating DataFrames with nested structures, and various methods to select and manipulate these nested fields. This knowledge should provide a solid foundation for handling nested data structures in your PySpark applications.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top