How to Use DataFrame PartitionBy to Save a Single Parquet File Per Partition?

In Apache Spark, the `partitionBy` method is part of the DataFrameWriter API, which allows you to partition your data by certain columns before writing it out. This is very useful when you want to segment your data into separate folders or files based on the values of those columns. Let’s explore how to use the `partitionBy` method to save a single Parquet file per partition.

Using `partitionBy` in PySpark

Here’s how you can use `partitionBy` in PySpark to save a DataFrame as multiple Parquet files, with each file representing a partition based on specified columns:


from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("PartitionBy Example") \
    .getOrCreate()

# Create a sample DataFrame
data = [
    ("John", "Finance", 3000),
    ("Doe", "Finance", 4500),
    ("Jane", "IT", 6000),
    ("Smith", "IT", 7200),
    ("Anna", "HR", 4000)
]

columns = ["Name", "Department", "Salary"]

df = spark.createDataFrame(data, schema=columns)

# Write DataFrame as Parquet files, partitioned by "Department"
df.write.partitionBy("Department").parquet("/path/to/output/directory")

# Stop the Spark session
spark.stop()

In this example, the DataFrame is partitioned by the “Department” column, and each partition is saved as a separate Parquet file in the specified directory. The output structure would look like this:


/path/to/output/directory/Department=Finance/part-00000-<unique-identifier>.parquet
/path/to/output/directory/Department=IT/part-00001-<unique-identifier>.parquet
/path/to/output/directory/Department=HR/part-00002-<unique-identifier>.parquet

Using `partitionBy` in Scala

For Scala, the process is quite similar:


import org.apache.spark.sql.SparkSession

// Initialize Spark Session
val spark = SparkSession.builder
  .appName("PartitionBy Example")
  .getOrCreate()

// Create a sample DataFrame
val data = Seq(
  ("John", "Finance", 3000),
  ("Doe", "Finance", 4500),
  ("Jane", "IT", 6000),
  ("Smith", "IT", 7200),
  ("Anna", "HR", 4000)
)

val columns = Seq("Name", "Department", "Salary")

import spark.implicits._
val df = spark.createDataFrame(data).toDF(columns: _*)

// Write DataFrame as Parquet files, partitioned by "Department"
df.write.partitionBy("Department").parquet("/path/to/output/directory")

// Stop the Spark session
spark.stop()

The output structure will be similar to the PySpark example above.

Caveat and Tips

  • Ensure that the output path is accessible and has write permissions set for the user running the Spark job.
  • If you desire only a single file per partition, control the number of output files by using the `coalesce` method.

Example Using `coalesce`:


df.coalesce(1).write.partitionBy("Department").parquet("/path/to/output/directory")

This will ensure only one file per partition in the output directory.


df.coalesce(1).write.partitionBy("Department").parquet("/path/to/output/directory")

By using the `partitionBy` method properly, you can save your DataFrame in an organized manner, which can greatly improve the efficiency of data retrieval and processing tasks downstream.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top