Apache Spark provides a flexible way to handle multiple CSV files using a combination of file path patterns and the Spark DataFrame API. This approach can be implemented using different languages supported by Spark, such as Python, Scala, or Java. Below is an explanation of how to import multiple CSV files in a single load using PySpark, followed by the output example.
Example Using PySpark
You can use Spark’s `read` method along with path patterns or list of paths to load multiple CSV files at once.
Code Snippet in PySpark
from pyspark.sql import SparkSession
# Initialize SparkSession
spark = SparkSession.builder.appName("ImportMultipleCSV").getOrCreate()
# Define path to multiple CSV files
csv_paths = "hdfs:///path/to/csv/files/*.csv"
# Load multiple CSV files
df = spark.read.csv(csv_paths, header=True, inferSchema=True)
# Show DataFrame schema
df.printSchema()
# Display the first few rows of the DataFrame
df.show()
Output
root
|-- column1: string (nullable = true)
|-- column2: integer (nullable = true)
|-- column3: double (nullable = true)
+-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
| data| 1| 3.4|
| data| 2| 1.2|
| data| 3| 2.3|
| data| 4| 4.4|
+-------+-------+-------+
This code will read all CSV files matching the pattern `*.csv` in the specified directory. The `header=True` option ensures that the first row is used as the header, and `inferSchema=True` automatically infers the column data types.
Example Using Scala
A similar process can be followed using Scala, where the CSV paths can be specified and read into a Spark DataFrame.
Code Snippet in Scala
import org.apache.spark.sql.SparkSession
// Initialize SparkSession
val spark = SparkSession.builder.appName("ImportMultipleCSV").getOrCreate()
// Define path to multiple CSV files
val csvPaths = "hdfs:///path/to/csv/files/*.csv"
// Load multiple CSV files
val df = spark.read.option("header", "true").option("inferSchema", "true").csv(csvPaths)
// Show DataFrame schema
df.printSchema()
// Display the first few rows of the DataFrame
df.show()
Conclusion
Using patterns or a list of file paths allows you to import multiple CSV files efficiently in a single load with Apache Spark. This can be done easily using the DataFrame API, available in languages like Python and Scala. The method ensures that all files are read into a single DataFrame, enabling unified processing and analysis.