How to Filter PySpark DataFrame Column with None Values?

Filtering a PySpark DataFrame to remove rows where a specific column contains `None` values is a very common operation. This can be achieved using the `filter()` or `where()` methods provided by PySpark. Below is a detailed explanation along with code snippets on how to accomplish this task.

Using filter() or where() Methods

You can use the `filter()` or `where()` method to filter out rows where a particular column is `None`. Here is an example:

Example in PySpark

Let’s assume you have the following PySpark DataFrame:


from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize SparkSession
spark = SparkSession.builder.appName("FilterNoneValues").getOrCreate()

# Sample Data
data = [
    (1, "Alice", None),
    (2, "Bob", 29),
    (3, "Cathy", None),
    (4, "David", 40)
]

# Create DataFrame
df = spark.createDataFrame(data, ["id", "name", "age"])

# Show the original DataFrame
df.show()

+---+-----+----+
| id| name| age|
+---+-----+----+
|  1|Alice|null|
|  2|  Bob|  29|
|  3|Cathy|null|
|  4|David|  40|
+---+-----+----+

Now, to filter out rows where the `age` column is `None`, you can use the following code:


# Filter out rows where 'age' is None
filtered_df = df.filter(col("age").isNotNull())

# Show the filtered DataFrame
filtered_df.show()

+---+----+---+
| id|name|age|
+---+----+---+
|  2| Bob| 29|
|  4|David| 40|
+---+----+---+

Explanation

In the above example:

  • We started by creating a SparkSession and a sample DataFrame.
  • We then used the `filter()` method combined with the `isNotNull()` function from `pyspark.sql.functions` to filter out rows where the `age` column is `None`.
  • Finally, we displayed the filtered DataFrame using the `show()` method.

Using SQL Queries

Another way to achieve the same result is by using SQL queries. This can sometimes be more intuitive, especially for users familiar with SQL:

Example Using SQL Queries


# Register DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")

# SQL query to filter out rows where 'age' is None
filtered_df_sql = spark.sql("SELECT * FROM people WHERE age IS NOT NULL")

# Show the filtered DataFrame
filtered_df_sql.show()

+---+----+---+
| id|name|age|
+---+----+---+
|  2| Bob| 29|
|  4|David| 40|
+---+----+---+

In this example, we registered the DataFrame as a temporary SQL view and used a SQL query to filter out rows where the `age` column is `None`.

Explanation

In the second example:

  • We registered the DataFrame as a SQL temporary view using the `createOrReplaceTempView` method.
  • We then wrote an SQL query to select rows where the `age` column is not `NULL`.
  • Finally, we executed the query using the `spark.sql()` method and displayed the resulting DataFrame using the `show()` method.

Both methods are effective ways to filter out rows containing `None` values in a specific column of a PySpark DataFrame. The choice between them can depend on personal preference or the specific use case.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top