How to Provide Schema While Reading a CSV as DataFrame in Scala Spark?

To provide a schema while reading a CSV file as a DataFrame in Scala Spark, you can use the `StructType` and `StructField` classes. This can help in specifying the column names, data types, and also enforce data integrity. Below are the steps on how to achieve this:

Contents hide

1 Providing Schema While Reading a CSV as DataFrame in Scala Spark

1.1 Explanation:

1.2 Expected Output:

1.2.1 About Editorial Team

1.3 You Might Also Like:

Providing Schema While Reading a CSV as DataFrame in Scala Spark

Here’s an example demonstrating how to provide a schema while reading a CSV file using Scala in Spark:


import org.apache.spark.sql.{SparkSession, DataFrame}
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

val spark = SparkSession.builder()
  .appName("CSVWithSchemaExample")
  .config("spark.master", "local")
  .getOrCreate()

// Define the schema
val schema = StructType(Array(
  StructField("Name", StringType, true),
  StructField("Age", IntegerType, true),
  StructField("City", StringType, true)
))

// Read CSV with custom schema
val df: DataFrame = spark.read
  .format("csv")
  .option("header", "true")
  .schema(schema)
  .load("path_to_your_csv_file.csv")

df.show()

Explanation:

Let’s break down the code:

Importing necessary libraries: Import libraries required to work with DataFrame and define schema.
Creating SparkSession: Create a SparkSession object which is the entry point for reading data and executing Spark queries.
Defining the schema: Use `StructType` and `StructField` to define the schema for the CSV file. Here:
- `Name` is of `StringType`
- `Age` is of `IntegerType`
- `City` is of `StringType`
Reading CSV: Use `spark.read.format(“csv”)` to specify that we are reading a CSV file. The `.option(“header”, “true”)` ensures that the first row is used as the header. `.schema(schema)` applies the defined schema to the CSV file.
Displaying DataFrame: Use `df.show()` to display the content of the DataFrame.

Expected Output:


+-----+---+------+
| Name|Age|  City|
+-----+---+------+
| John| 25|  York|
| Jane| 30| Paris|
|Doe  | 35|Berlin|
+-----+---+------+

In the above output, the data from the CSV file is displayed in a tabular format where the schema defined is enforced.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Providing Schema While Reading a CSV as DataFrame in Scala Spark

Explanation:

Expected Output:

About Editorial Team

You Might Also Like:

Leave a Comment Cancel Reply