How to Export Data from Spark SQL to CSV: A Step-by-Step Guide

Exporting data from Spark SQL to CSV is a common requirement in data processing workflows. Apache Spark provides robust API support across different programming languages, making this task straightforward. Below, I’ll guide you through the steps to export data to CSV using PySpark, Scala, and Java. Code snippets are included for each language.

PySpark

First, let’s see how to export data from Spark SQL to CSV using PySpark. Assuming you already have a DataFrame (e.g., `df`), here are the steps:


from pyspark.sql import SparkSession

# Initialize a SparkSession
spark = SparkSession.builder.appName("ExportToCSV").getOrCreate()

# Sample DataFrame
data = [("James", "Smith", "USA", 40), ("Anna", "Rose", "UK", 35)]
columns = ["firstname", "lastname", "country", "age"]
df = spark.createDataFrame(data, columns)

# Show the DataFrame
df.show()

# Write DataFrame to CSV
df.write.csv("/path/to/output/directory", header=True, mode="overwrite")

spark.stop()

Output:


+---------+--------+-------+---+
|firstname|lastname|country|age|
+---------+--------+-------+---+
|    James|   Smith|    USA| 40|
|     Anna|    Rose|     UK| 35|
+---------+--------+-------+---+

Scala

Now, let’s proceed with Scala. Again, assuming you have the DataFrame `df` ready:


import org.apache.spark.sql.SparkSession

// Initialize SparkSession
val spark = SparkSession.builder.appName("ExportToCSV").getOrCreate()

// Sample DataFrame
import spark.implicits._
val data = Seq(("James", "Smith", "USA", 40), ("Anna", "Rose", "UK", 35))
val df = data.toDF("firstname", "lastname", "country", "age")

// Show the DataFrame
df.show()

// Write DataFrame to CSV
df.write
  .option("header", "true")
  .mode("overwrite")
  .csv("/path/to/output/directory")

spark.stop()

Output:


+---------+--------+-------+---+
|firstname|lastname|country|age|
+---------+--------+-------+---+
|    James|   Smith|    USA| 40|
|     Anna|    Rose|     UK| 35|
+---------+--------+-------+---+

Java

Finally, let’s take a look at how to achieve the same in Java. Assuming you have a `Dataset` named `df`:


import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

import java.util.Arrays;
import java.util.List;

public class ExportToCSV {
    public static void main(String[] args) {
        SparkSession spark = SparkSession.builder().appName("ExportToCSV").getOrCreate();

        // Sample DataFrame
        List<Row> data = Arrays.asList(
                org.apache.spark.sql.RowFactory.create("James", "Smith", "USA", 40),
                org.apache.spark.sql.RowFactory.create("Anna", "Rose", "UK", 35)
        );

        org.apache.spark.sql.types.StructType schema = new org.apache.spark.sql.types.StructType(new org.apache.spark.sql.types.StructField[]{
                new org.apache.spark.sql.types.StructField("firstname", org.apache.spark.sql.types.DataTypes.StringType, false, org.apache.spark.sql.types.Metadata.empty()),
                new org.apache.spark.sql.types.StructField("lastname", org.apache.spark.sql.types.DataTypes.StringType, false, org.apache.spark.sql.types.Metadata.empty()),
                new org.apache.spark.sql.types.StructField("country", org.apache.spark.sql.types.DataTypes.StringType, false, org.apache.spark.sql.types.Metadata.empty()),
                new org.apache.spark.sql.types.StructField("age", org.apache.spark.sql.types.DataTypes.IntegerType, false, org.apache.spark.sql.types.Metadata.empty())
        });

        Dataset<Row> df = spark.createDataFrame(data, schema);

        // Show the DataFrame
        df.show();

        // Write DataFrame to CSV
        df.write()
                .option("header", "true")
                .mode("overwrite")
                .csv("/path/to/output/directory");

        spark.stop();
    }
}

Output:


+---------+--------+-------+---+
|firstname|lastname|country|age|
+---------+--------+-------+---+
|    James|   Smith|    USA| 40|
|     Anna|    Rose|     UK| 35|
+---------+--------+-------+---+

Conclusion

In this guide, we’ve shown how to export data from Spark SQL to CSV using PySpark, Scala, and Java. Each language has its own way of handling Spark DataFrames, but they all use the `write.csv` method to export data. Make sure to replace `”/path/to/output/directory”` with the actual path where you want to save the CSV files. By following these steps, you can easily export your Spark DataFrame to CSV files for further use.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top