How to Quickly Get the Count of Records in a DataFrame?

When working with Apache Spark, one common task is to quickly get the count of records in a DataFrame. This is generally done using the `.count()` method, which returns the number of rows in the DataFrame. Below is an explanation and examples in PySpark, Scala, and Java on how you can achieve this.

PySpark

In PySpark, you can use the `.count()` method on a DataFrame object to get the number of records.


from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Sample DataFrame
data = [("Alice", 29), ("Bob", 31), ("Cathy", 25)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, schema=columns)

# Get the count of records
count = df.count()
print(f"Number of records in the DataFrame: {count}")

Number of records in the DataFrame: 3

Scala

In Scala, the `.count()` method is used in a similar way as in PySpark. Here’s how you can get the count of records in a DataFrame:


import org.apache.spark.sql.SparkSession

// Create a Spark session
val spark = SparkSession.builder.appName("example").getOrCreate()

// Sample DataFrame
val data = Seq(("Alice", 29), ("Bob", 31), ("Cathy", 25))
val df = spark.createDataFrame(data).toDF("Name", "Age")

// Get the count of records
val count = df.count()
println(s"Number of records in the DataFrame: $count")

Number of records in the DataFrame: 3

Java

In Java, you would use the `.count()` method from the DataFrame class. It is a bit more verbose compared to PySpark and Scala:


import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import java.util.Arrays;
import org.apache.spark.sql.RowFactory;

public class Main {
    public static void main(String[] args) {
        // Create a Spark session
        SparkSession spark = SparkSession.builder().appName("example").getOrCreate();

        // Sample data
        List<Row> data = Arrays.asList(
            RowFactory.create("Alice", 29),
            RowFactory.create("Bob", 31),
            RowFactory.create("Cathy", 25)
        );

        StructType schema = new StructType(new StructField[]{
            new StructField("Name", DataTypes.StringType, false, Metadata.empty()),
            new StructField("Age", DataTypes.IntegerType, false, Metadata.empty())
        });

        Dataset<Row> df = spark.createDataFrame(data, schema);

        // Get the count of records
        long count = df.count();
        System.out.println("Number of records in the DataFrame: " + count);
    }
}

Number of records in the DataFrame: 3

In each example, we create a Spark session, build a sample DataFrame, and then use the `.count()` method to quickly retrieve the number of records in the DataFrame. This method is efficient and straightforward for obtaining the row count, regardless of the language or Spark API you are using.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top