How to Convert Spark RDD to DataFrame in Python?

Converting a Spark RDD (Resilient Distributed Dataset) to a DataFrame is a common task in data processing. It allows for better optimization and a richer API for data manipulation. Here’s how you can achieve this in PySpark:

Converting Spark RDD to DataFrame in Python

First, ensure you’ve imported the necessary libraries and initialized a Spark session:


from pyspark.sql import SparkSession

# Initialize a SparkSession
spark = SparkSession.builder.appName("RDDtoDataFrame").getOrCreate()

Let’s consider a simple example where we have an RDD containing data about persons:


# Example RDD
data = [("James", "Smith", "USA", 25),
        ("Michael", "Rose", "USA", 30),
        ("Robert", "Williams", "USA", 45)]

# Create an RDD from the data
rdd = spark.sparkContext.parallelize(data)

Next, define a schema for the DataFrame:


from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define the schema
schema = StructType([
    StructField("firstname", StringType(), True),
    StructField("lastname", StringType(), True),
    StructField("country", StringType(), True),
    StructField("age", IntegerType(), True)
])

Now, convert the RDD to a DataFrame using the defined schema:


# Convert RDD to DataFrame
df = spark.createDataFrame(rdd, schema)

Perform some operations to verify the conversion:


# Show the DataFrame
df.show()

# Print the schema of the DataFrame
df.printSchema()

The output of the above code will be:


+---------+--------+-------+---+
|firstname|lastname|country|age|
+---------+--------+-------+---+
|    James|   Smith|    USA| 25|
|  Michael|    Rose|    USA| 30|
|   Robert|Williams|    USA| 45|
+---------+--------+-------+---+

root
 |-- firstname: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- country: string (nullable = true)
 |-- age: integer (nullable = true)

Another Approach: Using RDD of Row Objects

Another way to convert an RDD to a DataFrame is by using an RDD of Row objects. This method is useful when the schema is complex or you want to define the schema at a later stage:


from pyspark.sql import Row

# Create an RDD of Row objects
rdd = spark.sparkContext.parallelize([
    Row(firstname="James", lastname="Smith", country="USA", age=25),
    Row(firstname="Michael", lastname="Rose", country="USA", age=30),
    Row(firstname="Robert", lastname="Williams", country="USA", age=45)
])

# Convert RDD to DataFrame
df = spark.createDataFrame(rdd)

# Show the DataFrame
df.show()

# Print the schema of the DataFrame
df.printSchema()

The output will be similar to the previous method:


+---------+--------+-------+---+
|firstname|lastname|country|age|
+---------+--------+-------+---+
|    James|   Smith|    USA| 25|
|  Michael|    Rose|    USA| 30|
|   Robert|Williams|    USA| 45|
+---------+--------+-------+---+

root
 |-- firstname: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- country: string (nullable = true)
 |-- age: long (nullable = true)

Note that by default, the age column is inferred as a long type. If you want to enforce specific types, it’s better to use the first approach.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top