When writing a single CSV file using Spark, the challenge is that Spark by default writes multiple part files. This behavior occurs because Spark processes data in parallel across multiple nodes, and each task writes its own part file. To ensure that the data is written to a single CSV file, you typically need to use certain techniques to coalesce or repartition the RDD/DataFrame to a single partition before writing it out.
Writing a Single CSV File Using PySpark
Here’s how you can write a single CSV file using PySpark:
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("SingleCSVFileExample").getOrCreate()
# Sample data
data = [("James", "Smith", "USA", 34), ("Anna", "Jones", "UK", 30)]
# Create DataFrame
columns = ["firstname", "lastname", "country", "age"]
df = spark.createDataFrame(data, columns)
# Coalesce the data to a single partition and write as CSV
df.coalesce(1).write.csv("output/single_file.csv", header=True)
# Stop the Spark session
spark.stop()
This code will create a single CSV file named single_file.csv
in the output directory with the content of the DataFrame. The function coalesce(1)
is used to reduce the number of partitions to one, ensuring that only one file is created.
Output
output/
└── single_file.csv/
├── _SUCCESS
└── part-00000-<unique-id>.csv
Note that Spark adds a prefix part-00000-<unique-id>.csv
to identify the part file uniquely. If you want to rename the file, you’ll need to do this using a filesystem operation outside Spark.
Writing a Single CSV File Using Scala
Equivalent code using Scala:
import org.apache.spark.sql.SparkSession
// Initialize Spark session
val spark = SparkSession.builder.appName("SingleCSVFileExample").getOrCreate()
// Sample data
val data = Seq(("James", "Smith", "USA", 34), ("Anna", "Jones", "UK", 30))
// Create DataFrame
import spark.implicits._
val df = data.toDF("firstname", "lastname", "country", "age")
// Coalesce the data to a single partition and write as CSV
df.coalesce(1).write.option("header", "true").csv("output/single_file.csv")
// Stop the Spark session
spark.stop()
Output
output/
└── single_file.csv/
├── _SUCCESS
└── part-00000-<unique-id>.csv
Similar to the PySpark example, this Scala code will create a single CSV file in the specified output directory with the content of the DataFrame.
In both cases (Python and Scala), you may need to rename the output file manually if you want a specific filename instead of the default part file naming.
These solutions are effective when you want to ensure that your data is written to a single CSV file while using Apache Spark’s distributed computing environment.