How Do I Skip a Header from CSV Files in Spark?

Skipping headers from CSV files when reading them in Apache Spark is a common requirement, as headers typically contain column names that should not be processed as data. You can skip the header row using Spark’s DataFrame API. Below are approaches in PySpark, Scala, and Java to skip headers when reading CSV files.

Using PySpark

In PySpark, you can use the `option` method with the parameter `header` set to `True` to indicate that the first row is a header and should be skipped.


from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("SkipHeaderExample").getOrCreate()

# Read CSV file with header
df = spark.read.option("header", "true").csv("/path/to/csvfile.csv")

# Show DataFrame schema and data
df.printSchema()
df.show()

root
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- city: string (nullable = true)

+----+---+-----+
|name|age| city|
+----+---+-----+
|John| 30|  NYC|
| Anna| 25|London|
|Mike| 35|  SF|
+----+---+-----+

Using Scala

In Scala, you achieve the same by setting the `header` option to `true` while reading the CSV file.


import org.apache.spark.sql.SparkSession

// Initialize Spark session
val spark = SparkSession.builder.appName("SkipHeaderExample").getOrCreate()

// Read CSV file with header
val df = spark.read.option("header", "true").csv("/path/to/csvfile.csv")

// Display DataFrame schema and data
df.printSchema()
df.show()

root
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- city: string (nullable = true)

+----+---+-----+
|name|age| city|
+----+---+-----+
|John| 30|  NYC|
| Anna| 25|London|
|Mike| 35|  SF|
+----+---+-----+

Using Java

In Java, you can also skip the header by using the `header` option. Here’s how you can do it:


import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class SkipHeaderExample {
    public static void main(String[] args) {
        // Initialize Spark session
        SparkSession spark = SparkSession.builder().appName("SkipHeaderExample").getOrCreate();

        // Read CSV file with header
        Dataset<Row> df = spark.read().option("header", "true").csv("/path/to/csvfile.csv");

        // Display DataFrame schema and data
        df.printSchema();
        df.show();
    }
}

root
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- city: string (nullable = true)

+----+---+-----+
|name|age| city|
+----+---+-----+
|John| 30|  NYC|
| Anna| 25|London|
|Mike| 35|  SF|
+----+---+-----+

Explanation

By setting the `option(“header”, “true”)`, Spark reads the first line as header information and treats it as column names for the DataFrame. This option is available in the DataFrame API for all supported languages, and it is the most efficient way to skip the header row in CSV files.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top