To extract the first 1000 rows of a Spark DataFrame, you can use the `limit` function followed by `collect`. The `limit` function restricts the number of rows in the DataFrame to the specified amount, and the `collect` function retrieves those rows to the driver program. Here’s how you can do it in various languages:
Using PySpark
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("ExampleApp").getOrCreate()
# Creating a DataFrame
data = [(i,) for i in range(10000)] # Example data
df = spark.createDataFrame(data, ["number"])
# Extract the first 1000 rows
limited_df = df.limit(1000)
result = limited_df.collect()
# Display result
for row in result:
print(row)
Output
Row(number=0)
Row(number=1)
Row(number=2)
...
Row(number=999)
Using Scala
import org.apache.spark.sql.SparkSession
// Create a Spark session
val spark = SparkSession.builder.appName("ExampleApp").getOrCreate()
// Creating a DataFrame
val data = Seq.range(0, 10000).map(Tuple1(_))
val df = spark.createDataFrame(data).toDF("number")
// Extract the first 1000 rows
val limitedDF = df.limit(1000)
val result = limitedDF.collect()
// Display result
result.foreach(println)
Output
[0]
[1]
[2]
...
[999]
Using Java
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class ExtractRows {
public static void main(String[] args) {
// Create a Spark session
SparkSession spark = SparkSession.builder().appName("ExampleApp").getOrCreate();
// Creating a DataFrame
List<Row> data = new ArrayList<>();
for (int i = 0; i < 10000; i++) {
data.add(RowFactory.create(i));
}
StructType schema = new StructType(new StructField[]{
new StructField("number", DataTypes.IntegerType, false, Metadata.empty())
});
Dataset<Row> df = spark.createDataFrame(data, schema);
// Extract the first 1000 rows
Dataset<Row> limitedDF = df.limit(1000);
List<Row> result = limitedDF.collectAsList();
// Display result
result.forEach(System.out::println);
}
}
Output
[0]
[1]
[2]
...
[999]
By using the `limit` method followed by `collect`, you can extract the first 1000 rows of a Spark DataFrame in an efficient manner. Note that the `collect` method should be used with caution for very large datasets, as it brings all the data to the driver node.