How to Extract the First 1000 Rows of a Spark DataFrame?

To extract the first 1000 rows of a Spark DataFrame, you can use the `limit` function followed by `collect`. The `limit` function restricts the number of rows in the DataFrame to the specified amount, and the `collect` function retrieves those rows to the driver program. Here’s how you can do it in various languages:

Using PySpark


from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("ExampleApp").getOrCreate()

# Creating a DataFrame
data = [(i,) for i in range(10000)]  # Example data
df = spark.createDataFrame(data, ["number"])

# Extract the first 1000 rows
limited_df = df.limit(1000)
result = limited_df.collect()

# Display result
for row in result:
    print(row)

Output


Row(number=0)
Row(number=1)
Row(number=2)
...
Row(number=999)

Using Scala


import org.apache.spark.sql.SparkSession

// Create a Spark session
val spark = SparkSession.builder.appName("ExampleApp").getOrCreate()

// Creating a DataFrame
val data = Seq.range(0, 10000).map(Tuple1(_))
val df = spark.createDataFrame(data).toDF("number")

// Extract the first 1000 rows
val limitedDF = df.limit(1000)
val result = limitedDF.collect()

// Display result
result.foreach(println)

Output


[0]
[1]
[2]
...
[999]

Using Java


import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class ExtractRows {
    public static void main(String[] args) {
        // Create a Spark session
        SparkSession spark = SparkSession.builder().appName("ExampleApp").getOrCreate();

        // Creating a DataFrame
        List<Row> data = new ArrayList<>();
        for (int i = 0; i < 10000; i++) {
            data.add(RowFactory.create(i));
        }

        StructType schema = new StructType(new StructField[]{
                new StructField("number", DataTypes.IntegerType, false, Metadata.empty())
        });

        Dataset<Row> df = spark.createDataFrame(data, schema);

        // Extract the first 1000 rows
        Dataset<Row> limitedDF = df.limit(1000);
        List<Row> result = limitedDF.collectAsList();

        // Display result
        result.forEach(System.out::println);
    }
}

Output


[0]
[1]
[2]
...
[999]

By using the `limit` method followed by `collect`, you can extract the first 1000 rows of a Spark DataFrame in an efficient manner. Note that the `collect` method should be used with caution for very large datasets, as it brings all the data to the driver node.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top