Spark How to Load CSV File into RDD

Apache Spark is a powerful open-source distributed computing framework that enables efficient and scalable data processing. One of its core abstractions is Resilient Distributed Datasets (RDDs), which are fault-tolerant, parallel data structures used for processing data in a distributed manner.

In this tutorial, we will walk you through the process of loading a CSV file into an RDD using Apache Spark. RDDs are the fundamental building blocks for data manipulation in Spark, making it essential to know how to load data into them.

Before we begin, ensure that you have Apache Spark installed and configured on your system. You can download and install Spark from the official website:

Loading the CSV File

Assuming you have a CSV file named “data.csv” in your local filesystem, you can load it into an RDD using the textFile method provided by SparkContext:

val csvFile = "data.csv"  // Replace with the path to your CSV file
val csvRDD: RDD[String] = sc.textFile(csvFile)

You now have your CSV data in the csvRDD. If you want to perform further operations on this data, you can use various RDD transformations and actions provided by Spark. For example, you can split the CSV lines into an array of values:

val csvData = csvRDD.map(line => line.split("\\|"))  // Assuming CSV file uses pipe (|) as a delimiter

You can then perform various operations on csvData, such as filtering, aggregating, or analyzing your data using Spark’s RDD API.

Number of records in the CSV file

To get number of records we can use count() method. Here is the example

csvRDD.count() // Number of items in this RDD

How to skip header from CSV file

If CSV files contain header and want to skip header in various operation, first you will extract header using first() method then filter out the header using filter()

    val header: String = csvData.first()
    val filteredData = csvData.filter(_ != header).map(line => line.split("\\|"))

Read Multiple CSV Files into Single RDD

if you have multiple text files and want to read the content of all these files into a single RDD, you can pass a comma-separated list of file paths to the textFile() method. Here’s how you can do it in Scala:

    // Set the path to your CSV file
    val csvFilePath = "E:/data/file1.csv,E:/data/file2.csv"

    // Load the CSV file as an RDD of Strings
    val csvData = spark.sparkContext.textFile(csvFilePath)

Read all CSV files in a directory to single RDD

n Apache Spark, if you want to read all CSV files in a directory or folder, you can simply pass the directory path to the textFile() method. Spark will automatically read all the files in that directory into a single RDD. Here’s how you can do it in Scala:

// Set the path to your CSV file
    val csvFilePath = "E:/data/"

    // Load the CSV file as an RDD of Strings
    val csvData = spark.sparkContext.textFile(csvFilePath)

Read all CSV files from multiple directories to single RDD

To read all text files in multiple directories into a single RDD using the textFile method in Apache Spark, you can provide a comma-separated list of directory paths as an argument to textFile. Spark will automatically read all files in those directories and create a single RDD. Here’s how you can do it in Scala:

val csvDir = "E:/data1/,E:/data2/";
         
val csvData = spark.sparkContext.textFile(csvDir) 

Read all text files that match a pattern into a single RDD

To read all text files in multiple directories, matching a pattern for file names, into a single RDD in Apache Spark using Scala, you can use the textFile method with a file name pattern. This method reads all files matching the specified pattern and creates an RDD where each record represents the contents of a single file. Here’s how you can do it:

val files = "E:/data1/data-file[0-2].txt,E:/data2/file*"

val csvData = spark.sparkContext.textFile(files)

In this code:

  • files is a comma-separated string containing two file path patterns.
  • The first pattern, "E:/data1/data-file[0-2].txt", uses square brackets to match files named “data-file0.txt,” “data-file1.txt,” and “data-file2.txt”.
  • The second pattern, "E:/data2/file*", uses an asterisk (*) as a wildcard to match files with any name that start with “file” in the “data2” directory.

Load CSV file into RDD Example with output

import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD

object LoadCSVIntoRDD {
  def main(args: Array[String]): Unit = {
    // Create a SparkSession
    val spark = SparkSession.builder()
      .appName("LoadCSVIntoRDD")
      .master("local")
      .getOrCreate()

    // Set the path to your CSV file
    val csvFilePath = "E:/sparktpoint/data/file.csv"

    // Load the CSV file as an RDD of Strings
    val csvData = spark.sparkContext.textFile(csvFilePath)

    println("Number of records including header:"+csvData.count())

    val header: String = csvData.first()
    // Filter out the header and split each line by a | (assuming CSV format)
    val filteredData = csvData.filter(_ != header).map(line => line.split("\\|"))

    // Filter records where age > 35
    val filteredAgeData = filteredData.filter(row => row.length == 4 && row(3).trim.toInt > 35)

    // Print header
    println(header)
    // Print first name, last name, gender, and age for filtered records
    filteredAgeData.foreach { row =>
      if (row.length == 4) {
        val firstName = row(0).trim
        val lastName = row(1).trim
        val gender = row(2).trim
        val age = row(3).trim.toInt
        println(s"$firstName|$lastName|$gender|$age")
      }
    }
    spark.stop()
  }
}
/*
Input
--------
First_Name|Last_Name|Gender|Age
John|Doe|Male|28
Jane|Smith|Female|32
Michael|Johnson|Male|45
Emily|Williams|Female|27
William|Brown|Male|38
Olivia|Jones|Female|29
David|Davis|Male|55
Sophia|Miller|Female|22
James|Anderson|Male|41
Ava|White|Female|34

Output
-------
Number of records including header:11
First_Name|Last_Name|Gender|Age
Michael|Johnson|Male|45
William|Brown|Male|38
David|Davis|Male|55
James|Anderson|Male|41

*/

Conclusion

In conclusion, Apache Spark provides powerful tools for efficiently processing and analyzing large-scale data. Reading text files into a single RDD, whether they are located in multiple directories or match a specific pattern, is a common operation in Spark that allows you to consolidate and manipulate data for various data processing tasks. By leveraging Spark’s capabilities, you can perform complex data transformations and analysis on your data with ease, making it a valuable tool for big data analytics and computation.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top