How to Read a DataFrame from a Partitioned Parquet File in Apache Spark?

When working with large datasets, it’s common to partition data into smaller, more manageable pieces. Apache Spark supports reading partitioned data from Parquet files efficiently. Below is a detailed explanation of the process, including code snippets in PySpark, Scala, and Java.

Contents hide

1 Reading a DataFrame from a Partitioned Parquet File

3 About Editorial Team

4 You Might Also Like:

Reading a DataFrame from a Partitioned Parquet File

PySpark

To read a DataFrame from a partitioned Parquet file in PySpark, you can use the `spark.read.parquet` method. Here’s an example:


from pyspark.sql import SparkSession

# Creating a Spark session
spark = SparkSession.builder \
    .appName("Read Partitioned Parquet") \
    .getOrCreate()

# Reading the partitioned Parquet file
df = spark.read.parquet("path/to/partitioned_parquet")

# Show the DataFrame schema
df.printSchema()

# Show some data
df.show()

Output:


root
 |-- column1: string (nullable = true)
 |-- column2: integer (nullable = true)
 |-- partition_column: string (nullable = true)

+-------+-------+----------------+
|column1|column2|partition_column|
+-------+-------+----------------+
|    foo|     42|            part1|
|    bar|     43|            part2|
+-------+-------+----------------+

Scala

In Scala, you can achieve the same result using the `spark.read.parquet` method as shown below:


import org.apache.spark.sql.SparkSession

// Creating a Spark session
val spark = SparkSession.builder
  .appName("Read Partitioned Parquet")
  .getOrCreate()

// Reading the partitioned Parquet file
val df = spark.read.parquet("path/to/partitioned_parquet")

// Show the DataFrame schema
df.printSchema()

// Show some data
df.show()

Output:


root
 |-- column1: string (nullable = true)
 |-- column2: integer (nullable = true)
 |-- partition_column: string (nullable = true)

+-------+-------+----------------+
|column1|column2|partition_column|
+-------+-------+----------------+
|    foo|     42|            part1|
|    bar|     43|            part2|
+-------+-------+----------------+

Java

The following Java code demonstrates how to read a partitioned Parquet file:


import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class ReadPartitionedParquet {
    public static void main(String[] args) {
        // Creating a Spark session
        SparkSession spark = SparkSession.builder()
            .appName("Read Partitioned Parquet")
            .getOrCreate();

        // Reading the partitioned Parquet file
        Dataset<Row> df = spark.read().parquet("path/to/partitioned_parquet");

        // Show the DataFrame schema
        df.printSchema();

        // Show some data
        df.show();
    }
}

Output:


root
 |-- column1: string (nullable = true)
 |-- column2: integer (nullable = true)
 |-- partition_column: string (nullable = true)

+-------+-------+----------------+
|column1|column2|partition_column|
+-------+-------+----------------+
|    foo|     42|            part1|
|    bar|     43|            part2|
+-------+-------+----------------+

Explanation

In all three examples, the `spark.read.parquet` method is used to read data from the partitioned Parquet file located at `path/to/partitioned_parquet`. The resulting DataFrame will include all the columns from the Parquet files, including those representing the partitions.

Given a directory structure like:


path/to/partitioned_parquet/
  ├── partition_column=part1
  │   └── data1.parquet
  ├── partition_column=part2
      └── data2.parquet

Partitions are automatically read and included as columns in the DataFrame. This allows you to filter or aggregate data based on partition values efficiently.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Reading a DataFrame from a Partitioned Parquet File

PySpark

Scala

Java

Explanation

About Editorial Team

You Might Also Like:

Leave a Comment Cancel Reply