How to Parse a Column of JSON Strings in PySpark?

Parsing a column of JSON strings in PySpark involves converting the JSON strings in a DataFrame column into a structured format. This can be done using the `from_json` function provided by PySpark. Below, we’ll go through the detailed steps and provide an example to illustrate the process.

Contents hide

1 Step-by-Step Process

1.1 1. Import Required Libraries

1.2 2. Initialize Spark Session

1.3 3. Sample DataFrame with JSON Column

1.4 4. Define Schema for JSON String

1.5 5. Parse JSON Strings

1.6 6. Extract Fields from Parsed JSON

2 Conclusion

3 About Editorial Team

4 You Might Also Like:

Step-by-Step Process

1. Import Required Libraries

First, you need to import the necessary libraries. We will use PySpark’s SQL module to work with DataFrames and the `from_json` function to parse the JSON strings.


from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

2. Initialize Spark Session

Next, create a Spark session to work with Spark DataFrames.


spark = SparkSession.builder \
    .appName("Parse JSON Column") \
    .getOrCreate()

3. Sample DataFrame with JSON Column

Let’s create a sample DataFrame containing JSON strings that we will parse.


data = [
    ('{"name": "Alice", "age": 30}',),
    ('{"name": "Bob", "age": 25}',),
    ('{"name": "Charlie", "age": 35}',)
]

df = spark.createDataFrame(data, ["json_string"])
df.show(truncate=False)


+------------------------+
|json_string             |
+------------------------+
|{"name": "Alice", "age": 30} |
|{"name": "Bob", "age": 25}   |
|{"name": "Charlie", "age": 35}|
+------------------------+

4. Define Schema for JSON String

Define the schema of the JSON data that will be used to parse the JSON strings.


schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
])

5. Parse JSON Strings

Use the `from_json` function to parse the JSON strings and create a new column with a structured format.


df_parsed = df.withColumn("parsed_json", from_json(col("json_string"), schema))
df_parsed.show(truncate=False)


+---------------------------+------------------+
|json_string                |parsed_json       |
+---------------------------+------------------+
|{"name": "Alice", "age": 30}|{Alice, 30}       |
|{"name": "Bob", "age": 25}  |{Bob, 25}         |
|{"name": "Charlie", "age": 35}|{Charlie, 35}   |
+---------------------------+------------------+

6. Extract Fields from Parsed JSON

Finally, extract the individual fields from the parsed JSON column for easier access and analysis.


df_final = df_parsed.select(col("json_string"), col("parsed_json.name").alias("name"), col("parsed_json.age").alias("age"))
df_final.show(truncate=False)


+---------------------------+-------+---+
|json_string                |name   |age|
+---------------------------+-------+---+
|{"name": "Alice", "age": 30}|Alice  |30 |
|{"name": "Bob", "age": 25}  |Bob    |25 |
|{"name": "Charlie", "age": 35}|Charlie|35|
+---------------------------+-------+---+

Conclusion

In this process, we started with a DataFrame containing JSON strings and parsed these strings into a structured format using PySpark’s `from_json` function. We defined a schema for the JSON data, applied the parsing function, and finally extracted individual fields for easier access. This method can be particularly useful when dealing with nested JSON structures in your data processing pipelines.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.