When creating a Spark DataFrame, sometimes the schema cannot be inferred automatically, especially when the data is in a complex format. In such cases, you can explicitly define the schema using `StructType` and `StructField`. This approach allows for greater control over the data types and structure of your DataFrame.
Creating a Spark DataFrame with Explicit Schema in PySpark
First, we’ll go through an example in PySpark where we create a DataFrame with a predefined schema.
Let’s assume we have the following JSON data:
{"name": "John", "age": 30, "city": "New York"}
{"name": "Doe", "age": "25", "city": "Los Angeles"}
Here’s how you can define the schema and create the DataFrame:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
# Initialize Spark session
spark = SparkSession.builder.master("local[1]").appName('SparkByExamples.com').getOrCreate()
# Define the schema
schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("city", StringType(), True)
])
# Load the data with schema
json_data = [
{"name": "John", "age": 30, "city": "New York"},
{"name": "Doe", "age": 25, "city": "Los Angeles"}
]
# Create DataFrame
df = spark.createDataFrame(json_data, schema)
# Show the DataFrame
df.show()
Output:
+----+---+-------------+
|name|age| city|
+----+---+-------------+
|John| 30| New York|
| Doe| 25| Los Angeles|
+----+---+-------------+
Creating a Spark DataFrame with Explicit Schema in Scala
Similarly, in Scala, you can create a DataFrame with a predefined schema as follows:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
// Initialize Spark session
val spark = SparkSession.builder().appName("SparkExample").master("local[1]").getOrCreate()
// Define the schema
val schema = StructType(Array(
StructField("name", StringType, true),
StructField("age", IntegerType, true),
StructField("city", StringType, true)
))
// Load the data with schema
val jsonData = Seq(
"""{"name": "John", "age": 30, "city": "New York"}""",
"""{"name": "Doe", "age": 25, "city": "Los Angeles"}"""
)
import spark.implicits._
val rdd = spark.sparkContext.parallelize(jsonData)
val df = spark.read.schema(schema).json(rdd)
// Show the DataFrame
df.show()
Output:
+----+---+-------------+
|name|age| city|
+----+---+-------------+
|John| 30| New York|
| Doe| 25| Los Angeles|
+----+---+-------------+
By explicitly defining the schema, we ensure that the data types and structure are correctly interpreted by Spark, even if they cannot be inferred directly from the source data.