Apache Spark is a powerful distributed data processing engine that is widely used for big data analytics. PySpark is the Python API for Spark, which allows Python developers to write Spark code using Python. One of the useful methods provided by PySpark, especially when working with DataFrames, is the ‘between’ method. This method is commonly used to filter rows by checking if a column’s values fall within a specified range. In this article, we will explore the usage of the ‘between’ method in PySpark through examples.
Understanding the ‘between’ Method
The ‘between’ method in PySpark is used on DataFrame columns to filter the data. It takes two arguments, which define the lower and upper bounds of the range. Here is the syntax of the ‘between’ method:
DataFrame[columnName].between(lowerBound, upperBound)
This method returns a Column type, which contains boolean values indicating whether the column values are within the specified range or not. This boolean result can be used with the ‘filter’ or ‘where’ method of the DataFrame to actually perform the filtering.
Setting Up the PySpark Environment
Before we dive into examples, make sure that you have Spark installed and a PySpark session is started. Use the following code snippet to start a Spark session:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName(“PySpark Between Method Example”) \
.getOrCreate()
Replace “PySpark Between Method Example” with an appropriate name for your application.
Creating a DataFrame for Examples
Let’s create a simple DataFrame to use with our ‘between’ method examples. We will define a list of tuples, each representing a record with an ID, name, and age, and convert it to a DataFrame:
data = [(1, "Alice", 12),
(2, "Bob", 20),
(3, "Charlie", 15),
(4, "David", 25),
(5, "Eve", 18)]
columns = [“id”, “name”, “age”]
df = spark.createDataFrame(data, schema=columns)
df.show()
The expected output of the DataFrame ‘df’ will be:
+---+-------+---+
| id| name|age|
+---+-------+---+
| 1| Alice| 12|
| 2| Bob| 20|
| 3|Charlie| 15|
| 4| David| 25|
| 5| Eve| 18|
+---+-------+---+
Now that our DataFrame is created, we can use the ‘between’ method to filter rows.
Using the ‘between’ Method to Filter Data
Filtering a Single Column
To filter rows where the age is between 15 and 20 inclusive, we can use the ‘between’ method as follows:
filtered_df = df.filter(df["age"].between(15, 20))
filtered_df.show()
The code above uses the ‘filter’ method of the DataFrame to keep only the rows where the ‘age’ column’s value is between 15 and 20. The output will be:
+---+-------+---+
| id| name|age|
+---+-------+---+
| 2| Bob| 20|
| 3|Charlie| 15|
| 5| Eve| 18|
+---+-------+---+
Chaining Filters with the ‘between’ Method
The ‘between’ method can be chained with other filters to apply multiple conditions. For example, you might want to filter not only by age but also filter out a specific name:
chained_filter_df = df.filter(df["age"].between(15, 20) & (df["name"] != "Charlie"))
chained_filter_df.show()
This will filter the DataFrame rows to include only those where age is between 15 and 20 and the name is not “Charlie”. The expected output is:
+---+----+---+
| id|name|age|
+---+----+---+
| 2| Bob| 20|
| 5| Eve| 18|
+---+----+---+
Tips for Using the ‘between’ Method Effectively
– Be mindful of the inclusivity of the ‘between’ method; the range you specify is inclusive of both the lower and upper bounds.
– The ‘between’ method can also be used with dates or timestamps, as long as the column data type is appropriate and the bounds are specified in the right format.
– When chaining multiple filters, make sure to use parentheses to group conditions appropriately, as shown in the chained filter example above.
– The ‘between’ method can make the code more readable compared to using greater than or equal to and less than or equal to conditions separately.
Conclusion
The ‘between’ method in PySpark is a convenient way to filter DataFrame rows based on a single column’s value being within a specified range. In this article, we explored how to use the ‘between’ method with different types of filtering. By using ‘between’, you can make your data processing workflows more efficient and your code more readable. Whether you are working with numbers, dates, or timestamps, ‘between’ can be an excellent tool in your PySpark arsenal.
With this knowledge, you should be able to apply the ‘between’ method to your PySpark DataFrames with confidence and streamline your data analysis tasks.