How to Create an Empty DataFrame with a Specified Schema in Apache Spark?

Creating an empty DataFrame with a specified schema in Apache Spark is simple and can be done using various languages such as PySpark, Scala, and Java. Below I’ll provide examples in PySpark and Scala.

Contents hide

1 PySpark

2 Scala

3 About Editorial Team

4 You Might Also Like:

PySpark

In PySpark, you can use the `StructType` and `StructField` classes to define a schema and then create an empty DataFrame with that schema.


from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Initialize Spark session
spark = SparkSession.builder.appName("CreateEmptyDataFrame").getOrCreate()

# Define schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True)
])

# Create empty DataFrame with specified schema
empty_df = spark.createDataFrame([], schema)

# Show the schema of the DataFrame
empty_df.printSchema()


root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)

Scala

In Scala, you can achieve the same using the Spark API. Below is an example:


import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

// Initialize Spark session
val spark = SparkSession.builder.appName("CreateEmptyDataFrame").getOrCreate()

// Define schema
val schema = StructType(Array(
    StructField("Name", StringType, nullable = true),
    StructField("Age", IntegerType, nullable = true)
))

// Create empty DataFrame with specified schema
val emptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[org.apache.spark.sql.Row], schema)

// Show the schema of the DataFrame
emptyDF.printSchema()


root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)

By specifying the schema in this way, you ensure that the DataFrame has the desired structure even though it’s initially empty. This method is particularly useful for scenarios where the schema needs to be known and enforced at the start of data processing.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

PySpark

Scala

About Editorial Team

You Might Also Like:

Leave a Comment Cancel Reply