When working with Apache Spark, one common task is to quickly get the count of records in a DataFrame. This is generally done using the `.count()` method, which returns the number of rows in the DataFrame. Below is an explanation and examples in PySpark, Scala, and Java on how you can achieve this.
PySpark
In PySpark, you can use the `.count()` method on a DataFrame object to get the number of records.
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
# Sample DataFrame
data = [("Alice", 29), ("Bob", 31), ("Cathy", 25)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, schema=columns)
# Get the count of records
count = df.count()
print(f"Number of records in the DataFrame: {count}")
Number of records in the DataFrame: 3
Scala
In Scala, the `.count()` method is used in a similar way as in PySpark. Here’s how you can get the count of records in a DataFrame:
import org.apache.spark.sql.SparkSession
// Create a Spark session
val spark = SparkSession.builder.appName("example").getOrCreate()
// Sample DataFrame
val data = Seq(("Alice", 29), ("Bob", 31), ("Cathy", 25))
val df = spark.createDataFrame(data).toDF("Name", "Age")
// Get the count of records
val count = df.count()
println(s"Number of records in the DataFrame: $count")
Number of records in the DataFrame: 3
Java
In Java, you would use the `.count()` method from the DataFrame class. It is a bit more verbose compared to PySpark and Scala:
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import java.util.Arrays;
import org.apache.spark.sql.RowFactory;
public class Main {
public static void main(String[] args) {
// Create a Spark session
SparkSession spark = SparkSession.builder().appName("example").getOrCreate();
// Sample data
List<Row> data = Arrays.asList(
RowFactory.create("Alice", 29),
RowFactory.create("Bob", 31),
RowFactory.create("Cathy", 25)
);
StructType schema = new StructType(new StructField[]{
new StructField("Name", DataTypes.StringType, false, Metadata.empty()),
new StructField("Age", DataTypes.IntegerType, false, Metadata.empty())
});
Dataset<Row> df = spark.createDataFrame(data, schema);
// Get the count of records
long count = df.count();
System.out.println("Number of records in the DataFrame: " + count);
}
}
Number of records in the DataFrame: 3
In each example, we create a Spark session, build a sample DataFrame, and then use the `.count()` method to quickly retrieve the number of records in the DataFrame. This method is efficient and straightforward for obtaining the row count, regardless of the language or Spark API you are using.