How to Extract the First 1000 Rows of a Spark DataFrame?

To extract the first 1000 rows of a Spark DataFrame, you can use the `limit` function followed by `collect`. The `limit` function restricts the number of rows in the DataFrame to the specified amount, and the `collect` function retrieves those rows to the driver program. Here’s how you can do it in various languages:

Using PySpark


from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("ExampleApp").getOrCreate()

# Creating a DataFrame
data = [(i,) for i in range(10000)]  # Example data
df = spark.createDataFrame(data, ["number"])

# Extract the first 1000 rows
limited_df = df.limit(1000)
result = limited_df.collect()

# Display result
for row in result:
    print(row)

Output


Row(number=0)
Row(number=1)
Row(number=2)
...
Row(number=999)

Using Scala


import org.apache.spark.sql.SparkSession

// Create a Spark session
val spark = SparkSession.builder.appName("ExampleApp").getOrCreate()

// Creating a DataFrame
val data = Seq.range(0, 10000).map(Tuple1(_))
val df = spark.createDataFrame(data).toDF("number")

// Extract the first 1000 rows
val limitedDF = df.limit(1000)
val result = limitedDF.collect()

// Display result
result.foreach(println)

Output


[0]
[1]
[2]
...
[999]

Using Java


import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class ExtractRows {
    public static void main(String[] args) {
        // Create a Spark session
        SparkSession spark = SparkSession.builder().appName("ExampleApp").getOrCreate();

        // Creating a DataFrame
        List<Row> data = new ArrayList<>();
        for (int i = 0; i < 10000; i++) {
            data.add(RowFactory.create(i));
        }

        StructType schema = new StructType(new StructField[]{
                new StructField("number", DataTypes.IntegerType, false, Metadata.empty())
        });

        Dataset<Row> df = spark.createDataFrame(data, schema);

        // Extract the first 1000 rows
        Dataset<Row> limitedDF = df.limit(1000);
        List<Row> result = limitedDF.collectAsList();

        // Display result
        result.forEach(System.out::println);
    }
}

Output


[0]
[1]
[2]
...
[999]

By using the `limit` method followed by `collect`, you can extract the first 1000 rows of a Spark DataFrame in an efficient manner. Note that the `collect` method should be used with caution for very large datasets, as it brings all the data to the driver node.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top