How to Take a Random Row from a PySpark DataFrame?

To take a random row from a PySpark DataFrame, you can use the `sample` method, which allows you to randomly sample a fraction of the rows or a specific number of rows. Here’s a detailed explanation on how to achieve this with an example.

Using the `sample` Method

Let’s start by creating a sample PySpark DataFrame:


from pyspark.sql import SparkSession
from pyspark.sql.functions import rand

# Initialize Spark session
spark = SparkSession.builder.appName("RandomRowExample").getOrCreate()

# Create a sample DataFrame
data = [("Alice", 34), ("Bob", 45), ("Cathy", 29), ("David", 40)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)
df.show()

+-----+---+
| Name|Age|
+-----+---+
|Alice| 34|
|  Bob| 45|
|Cathy| 29|
|David| 40|
+-----+---+

Method 1: Using `sample` with a Small Fraction

One way to get a random row is to sample a small fraction with replacement, and then take the first row:


# Sample a small fraction of rows with replacement
random_row = df.sample(withReplacement=True, fraction=0.1).limit(1)
random_row.show()

Note: The output may vary each time you run the command, as it is random.


+-----+---+
| Name|Age|
+-----+---+
|Cathy| 29|
+-----+---+

Method 2: Using `orderBy` with `rand()`

Another way to get a random row is to use the `orderBy` function with the `rand()` function:


# Order by a random value and take the first row
random_row = df.orderBy(rand()).limit(1)
random_row.show()

+-----+---+
| Name|Age|
+-----+---+
|  Bob| 45|
+-----+---+

Again, the output may vary each time the command is run.

Conclusion

In summary, you can take a random row from a PySpark DataFrame by either using the `sample` method with a small fraction or by utilizing the `orderBy` function in combination with `rand()`. Both methods will give you a randomly chosen row from the DataFrame, and you can then proceed with further processing as needed.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top