How to Import Multiple CSV Files in a Single Load Using Apache Spark?

Apache Spark provides a flexible way to handle multiple CSV files using a combination of file path patterns and the Spark DataFrame API. This approach can be implemented using different languages supported by Spark, such as Python, Scala, or Java. Below is an explanation of how to import multiple CSV files in a single load using PySpark, followed by the output example.

Example Using PySpark

You can use Spark’s `read` method along with path patterns or list of paths to load multiple CSV files at once.

Code Snippet in PySpark


from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName("ImportMultipleCSV").getOrCreate()

# Define path to multiple CSV files
csv_paths = "hdfs:///path/to/csv/files/*.csv"

# Load multiple CSV files
df = spark.read.csv(csv_paths, header=True, inferSchema=True)

# Show DataFrame schema
df.printSchema()

# Display the first few rows of the DataFrame
df.show()

Output


root
 |-- column1: string (nullable = true)
 |-- column2: integer (nullable = true)
 |-- column3: double (nullable = true)
 
+-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
|   data|      1|    3.4|
|   data|      2|    1.2|
|   data|      3|    2.3|
|   data|      4|    4.4|
+-------+-------+-------+

This code will read all CSV files matching the pattern `*.csv` in the specified directory. The `header=True` option ensures that the first row is used as the header, and `inferSchema=True` automatically infers the column data types.

Example Using Scala

A similar process can be followed using Scala, where the CSV paths can be specified and read into a Spark DataFrame.

Code Snippet in Scala


import org.apache.spark.sql.SparkSession

// Initialize SparkSession
val spark = SparkSession.builder.appName("ImportMultipleCSV").getOrCreate()

// Define path to multiple CSV files
val csvPaths = "hdfs:///path/to/csv/files/*.csv"

// Load multiple CSV files
val df = spark.read.option("header", "true").option("inferSchema", "true").csv(csvPaths)

// Show DataFrame schema
df.printSchema()

// Display the first few rows of the DataFrame
df.show()

Conclusion

Using patterns or a list of file paths allows you to import multiple CSV files efficiently in a single load with Apache Spark. This can be done easily using the DataFrame API, available in languages like Python and Scala. The method ensures that all files are read into a single DataFrame, enabling unified processing and analysis.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top