How to Read Multiple Text Files into a Single RDD in Apache Spark?

Reading multiple text files into a single RDD in Apache Spark is a common task, especially when you’re dealing with a large amount of data distributed across multiple files. This can be efficiently done using the `textFile` method available in the SparkContext or SparkSession. Below, I’ll provide examples using PySpark, Scala, and Java.

Reading Multiple Text Files in PySpark

In PySpark, you can specify multiple file paths as a string separated by commas, or you can use wildcard characters to read multiple files from a directory.


from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName("ReadMultipleTextFiles").getOrCreate()

# Reading multiple text files
file_paths = "hdfs://path/to/file1.txt,hdfs://path/to/file2.txt"
rdd = spark.sparkContext.textFile(file_paths)

# Alternatively, you can use a wildcard 
rdd = spark.sparkContext.textFile("hdfs://path/to/directory/*.txt")

print(rdd.collect())

['Line 1 from file1', 'Line 2 from file1', ..., 'Line 1 from file2', 'Line 2 from file2', ...]

Reading Multiple Text Files in Scala

In Scala, the approach is very similar. You can either use comma-separated paths or wildcard characters to specify the file paths.


import org.apache.spark.sql.SparkSession

// Initialize SparkSession
val spark = SparkSession.builder.appName("ReadMultipleTextFiles").getOrCreate()

// Reading multiple text files
val filePaths = "hdfs://path/to/file1.txt,hdfs://path/to/file2.txt"
val rdd = spark.sparkContext.textFile(filePaths)

// Alternatively, you can use a wildcard
val rdd = spark.sparkContext.textFile("hdfs://path/to/directory/*.txt")

rdd.collect().foreach(println)

Line 1 from file1
Line 2 from file1
...
Line 1 from file2
Line 2 from file2
...

Reading Multiple Text Files in Java

In Java, you can use the same principles by leveraging the textFile method in the JavaSparkContext class.


import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.SparkConf;

// Initialize SparkConf and JavaSparkContext
SparkConf conf = new SparkConf().setAppName("ReadMultipleTextFiles");
JavaSparkContext sc = new JavaSparkContext(conf);

// Reading multiple text files
String filePaths = "hdfs://path/to/file1.txt,hdfs://path/to/file2.txt";
JavaRDD<String> rdd = sc.textFile(filePaths);

// Alternatively, you can use a wildcard
JavaRDD<String> rddWildcard = sc.textFile("hdfs://path/to/directory/*.txt");

rdd.collect().forEach(System.out::println);

Line 1 from file1
Line 2 from file1
...
Line 1 from file2
Line 2 from file2
...

These examples demonstrate how you can read multiple text files into a single RDD in Spark using PySpark, Scala, and Java. Leveraging Spark’s capability to handle multiple file input paths or wildcards makes this task straightforward and efficient.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top