Skipping headers from CSV files when reading them in Apache Spark is a common requirement, as headers typically contain column names that should not be processed as data. You can skip the header row using Spark’s DataFrame API. Below are approaches in PySpark, Scala, and Java to skip headers when reading CSV files.
Using PySpark
In PySpark, you can use the `option` method with the parameter `header` set to `True` to indicate that the first row is a header and should be skipped.
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("SkipHeaderExample").getOrCreate()
# Read CSV file with header
df = spark.read.option("header", "true").csv("/path/to/csvfile.csv")
# Show DataFrame schema and data
df.printSchema()
df.show()
root
|-- name: string (nullable = true)
|-- age: string (nullable = true)
|-- city: string (nullable = true)
+----+---+-----+
|name|age| city|
+----+---+-----+
|John| 30| NYC|
| Anna| 25|London|
|Mike| 35| SF|
+----+---+-----+
Using Scala
In Scala, you achieve the same by setting the `header` option to `true` while reading the CSV file.
import org.apache.spark.sql.SparkSession
// Initialize Spark session
val spark = SparkSession.builder.appName("SkipHeaderExample").getOrCreate()
// Read CSV file with header
val df = spark.read.option("header", "true").csv("/path/to/csvfile.csv")
// Display DataFrame schema and data
df.printSchema()
df.show()
root
|-- name: string (nullable = true)
|-- age: string (nullable = true)
|-- city: string (nullable = true)
+----+---+-----+
|name|age| city|
+----+---+-----+
|John| 30| NYC|
| Anna| 25|London|
|Mike| 35| SF|
+----+---+-----+
Using Java
In Java, you can also skip the header by using the `header` option. Here’s how you can do it:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class SkipHeaderExample {
public static void main(String[] args) {
// Initialize Spark session
SparkSession spark = SparkSession.builder().appName("SkipHeaderExample").getOrCreate();
// Read CSV file with header
Dataset<Row> df = spark.read().option("header", "true").csv("/path/to/csvfile.csv");
// Display DataFrame schema and data
df.printSchema();
df.show();
}
}
root
|-- name: string (nullable = true)
|-- age: string (nullable = true)
|-- city: string (nullable = true)
+----+---+-----+
|name|age| city|
+----+---+-----+
|John| 30| NYC|
| Anna| 25|London|
|Mike| 35| SF|
+----+---+-----+
Explanation
By setting the `option(“header”, “true”)`, Spark reads the first line as header information and treats it as column names for the DataFrame. This option is available in the DataFrame API for all supported languages, and it is the most efficient way to skip the header row in CSV files.