How to Write a Single CSV File Using Spark-CSV?

When writing a single CSV file using Spark, the challenge is that Spark by default writes multiple part files. This behavior occurs because Spark processes data in parallel across multiple nodes, and each task writes its own part file. To ensure that the data is written to a single CSV file, you typically need to use certain techniques to coalesce or repartition the RDD/DataFrame to a single partition before writing it out.

Writing a Single CSV File Using PySpark

Here’s how you can write a single CSV file using PySpark:


from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("SingleCSVFileExample").getOrCreate()

# Sample data
data = [("James", "Smith", "USA", 34), ("Anna", "Jones", "UK", 30)]

# Create DataFrame
columns = ["firstname", "lastname", "country", "age"]
df = spark.createDataFrame(data, columns)

# Coalesce the data to a single partition and write as CSV
df.coalesce(1).write.csv("output/single_file.csv", header=True)

# Stop the Spark session
spark.stop()

This code will create a single CSV file named single_file.csv in the output directory with the content of the DataFrame. The function coalesce(1) is used to reduce the number of partitions to one, ensuring that only one file is created.

Output


output/
└── single_file.csv/
    ├── _SUCCESS
    └── part-00000-<unique-id>.csv

Note that Spark adds a prefix part-00000-<unique-id>.csv to identify the part file uniquely. If you want to rename the file, you’ll need to do this using a filesystem operation outside Spark.

Writing a Single CSV File Using Scala

Equivalent code using Scala:


import org.apache.spark.sql.SparkSession

// Initialize Spark session
val spark = SparkSession.builder.appName("SingleCSVFileExample").getOrCreate()

// Sample data
val data = Seq(("James", "Smith", "USA", 34), ("Anna", "Jones", "UK", 30))

// Create DataFrame
import spark.implicits._
val df = data.toDF("firstname", "lastname", "country", "age")

// Coalesce the data to a single partition and write as CSV
df.coalesce(1).write.option("header", "true").csv("output/single_file.csv")

// Stop the Spark session
spark.stop()

Output


output/
└── single_file.csv/
    ├── _SUCCESS
    └── part-00000-<unique-id>.csv

Similar to the PySpark example, this Scala code will create a single CSV file in the specified output directory with the content of the DataFrame.

In both cases (Python and Scala), you may need to rename the output file manually if you want a specific filename instead of the default part file naming.

These solutions are effective when you want to ensure that your data is written to a single CSV file while using Apache Spark’s distributed computing environment.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top