How to Read a DataFrame from a Partitioned Parquet File in Apache Spark?

When working with large datasets, it’s common to partition data into smaller, more manageable pieces. Apache Spark supports reading partitioned data from Parquet files efficiently. Below is a detailed explanation of the process, including code snippets in PySpark, Scala, and Java.

Reading a DataFrame from a Partitioned Parquet File

PySpark

To read a DataFrame from a partitioned Parquet file in PySpark, you can use the `spark.read.parquet` method. Here’s an example:


from pyspark.sql import SparkSession

# Creating a Spark session
spark = SparkSession.builder \
    .appName("Read Partitioned Parquet") \
    .getOrCreate()

# Reading the partitioned Parquet file
df = spark.read.parquet("path/to/partitioned_parquet")

# Show the DataFrame schema
df.printSchema()

# Show some data
df.show()

Output:


root
 |-- column1: string (nullable = true)
 |-- column2: integer (nullable = true)
 |-- partition_column: string (nullable = true)

+-------+-------+----------------+
|column1|column2|partition_column|
+-------+-------+----------------+
|    foo|     42|            part1|
|    bar|     43|            part2|
+-------+-------+----------------+

Scala

In Scala, you can achieve the same result using the `spark.read.parquet` method as shown below:


import org.apache.spark.sql.SparkSession

// Creating a Spark session
val spark = SparkSession.builder
  .appName("Read Partitioned Parquet")
  .getOrCreate()

// Reading the partitioned Parquet file
val df = spark.read.parquet("path/to/partitioned_parquet")

// Show the DataFrame schema
df.printSchema()

// Show some data
df.show()

Output:


root
 |-- column1: string (nullable = true)
 |-- column2: integer (nullable = true)
 |-- partition_column: string (nullable = true)

+-------+-------+----------------+
|column1|column2|partition_column|
+-------+-------+----------------+
|    foo|     42|            part1|
|    bar|     43|            part2|
+-------+-------+----------------+

Java

The following Java code demonstrates how to read a partitioned Parquet file:


import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class ReadPartitionedParquet {
    public static void main(String[] args) {
        // Creating a Spark session
        SparkSession spark = SparkSession.builder()
            .appName("Read Partitioned Parquet")
            .getOrCreate();

        // Reading the partitioned Parquet file
        Dataset<Row> df = spark.read().parquet("path/to/partitioned_parquet");

        // Show the DataFrame schema
        df.printSchema();

        // Show some data
        df.show();
    }
}

Output:


root
 |-- column1: string (nullable = true)
 |-- column2: integer (nullable = true)
 |-- partition_column: string (nullable = true)

+-------+-------+----------------+
|column1|column2|partition_column|
+-------+-------+----------------+
|    foo|     42|            part1|
|    bar|     43|            part2|
+-------+-------+----------------+

Explanation

In all three examples, the `spark.read.parquet` method is used to read data from the partitioned Parquet file located at `path/to/partitioned_parquet`. The resulting DataFrame will include all the columns from the Parquet files, including those representing the partitions.

Given a directory structure like:


path/to/partitioned_parquet/
  ├── partition_column=part1
  │   └── data1.parquet
  ├── partition_column=part2
      └── data2.parquet

Partitions are automatically read and included as columns in the DataFrame. This allows you to filter or aggregate data based on partition values efficiently.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top