How to Provide Schema While Reading a CSV as DataFrame in Scala Spark?

To provide a schema while reading a CSV file as a DataFrame in Scala Spark, you can use the `StructType` and `StructField` classes. This can help in specifying the column names, data types, and also enforce data integrity. Below are the steps on how to achieve this:

Providing Schema While Reading a CSV as DataFrame in Scala Spark

Here’s an example demonstrating how to provide a schema while reading a CSV file using Scala in Spark:


import org.apache.spark.sql.{SparkSession, DataFrame}
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

val spark = SparkSession.builder()
  .appName("CSVWithSchemaExample")
  .config("spark.master", "local")
  .getOrCreate()

// Define the schema
val schema = StructType(Array(
  StructField("Name", StringType, true),
  StructField("Age", IntegerType, true),
  StructField("City", StringType, true)
))

// Read CSV with custom schema
val df: DataFrame = spark.read
  .format("csv")
  .option("header", "true")
  .schema(schema)
  .load("path_to_your_csv_file.csv")

df.show()

Explanation:

Let’s break down the code:

  1. Importing necessary libraries: Import libraries required to work with DataFrame and define schema.
  2. Creating SparkSession: Create a SparkSession object which is the entry point for reading data and executing Spark queries.
  3. Defining the schema: Use `StructType` and `StructField` to define the schema for the CSV file. Here:
    • `Name` is of `StringType`
    • `Age` is of `IntegerType`
    • `City` is of `StringType`
  4. Reading CSV: Use `spark.read.format(“csv”)` to specify that we are reading a CSV file. The `.option(“header”, “true”)` ensures that the first row is used as the header. `.schema(schema)` applies the defined schema to the CSV file.
  5. Displaying DataFrame: Use `df.show()` to display the content of the DataFrame.

Expected Output:


+-----+---+------+
| Name|Age|  City|
+-----+---+------+
| John| 25|  York|
| Jane| 30| Paris|
|Doe  | 35|Berlin|
+-----+---+------+

In the above output, the data from the CSV file is displayed in a tabular format where the schema defined is enforced.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top