How to Find Median and Quantiles Using Spark?

Finding the median and quantiles of a dataset can be a common requirement in data analysis. Apache Spark provides several ways to achieve this. In Spark, you can use DataFrame methods along with SQL queries to determine the median and quantiles. Below are detailed explanations and examples using PySpark (Python) and Scala.

Finding Median and Quantiles Using PySpark

Preparation: Create a Spark Session and Sample Data

First, let’s start with creating a Spark session and some sample data for demonstration purposes.


from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr

# Create a Spark session
spark = SparkSession.builder.appName("MedianAndQuantiles").getOrCreate()

# Sample data
data = [(1,), (2,), (3,), (4,), (5,), (6,), (7,), (8,), (9,), (10,)]
columns = ["value"]

# Create DataFrame
df = spark.createDataFrame(data, columns)
df.show()

+-----+
|value|
+-----+
|    1|
|    2|
|    3|
|    4|
|    5|
|    6|
|    7|
|    8|
|    9|
|   10|
+-----+

Calculating Median

In Spark, to find the median value, you sort the DataFrame and then pick the middle value. If the number of elements is even, you take the average of the two middle elements.


# Calculate median
median_df = df.approxQuantile("value", [0.5], 0.001)
median = median_df[0]

print("Median:", median)

Median: 5.5

Calculating Quantiles

You can compute quantiles similarly by providing a list of quantiles you are interested in.


# Calculate quantiles
quantiles = df.approxQuantile("value", [0.25, 0.5, 0.75], 0.001)

print("Quantiles: 25th Percentile: {}, 50th Percentile (Median): {}, 75th Percentile: {}".format(quantiles[0], quantiles[1], quantiles[2]))

Quantiles: 25th Percentile: 3.0, 50th Percentile (Median): 5.5, 75th Percentile: 8.0

Finding Median and Quantiles Using Scala

Preparation: Create a Spark Session and Sample Data


import org.apache.spark.sql.SparkSession

// Create a Spark session
val spark = SparkSession.builder.appName("MedianAndQuantiles").getOrCreate()

// Sample data
import spark.implicits._
val data = Seq(1, 2, 3, 4, 5, 6, 7, 8, 9, 10).toDF("value")

data.show()

+-----+
|value|
+-----+
|    1|
|    2|
|    3|
|    4|
|    5|
|    6|
|    7|
|    8|
|    9|
|   10|
+-----+

Calculating Median

In Scala, you use the same method `approxQuantile` for calculating the median.


// Calculate median
val median = data.stat.approxQuantile("value", Array(0.5), 0.001)

println(s"Median: ${median(0)}")

Median: 5.5

Calculating Quantiles


// Calculate quantiles
val quantiles = data.stat.approxQuantile("value", Array(0.25, 0.5, 0.75), 0.001)

println(f"Quantiles: 25th Percentile: ${quantiles(0)}, 50th Percentile (Median): ${quantiles(1)}, 75th Percentile: ${quantiles(2)}")

Quantiles: 25th Percentile: 3.0, 50th Percentile (Median): 5.5, 75th Percentile: 8.0

Both PySpark and Scala provide similar methods for calculating median and quantiles using the `approxQuantile` function. This function is efficient and suitable for large datasets, which makes it a good fit for distributed computing environments like Apache Spark.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top