How Can You Delete Columns in a PySpark DataFrame?

When working with PySpark, you might encounter situations where you need to delete columns from a DataFrame. This can be accomplished using several methods such as the `drop` method or selecting specific columns using the `select` method without the columns you want to remove. Below, I’ll explain these methods with detailed explanations and examples in PySpark.

Method 1: Using the `drop` Method

The `drop` method can be used to remove one or more columns from a DataFrame. This method creates a new DataFrame with the specified columns removed.


from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName("DeleteColumnsExample").getOrCreate()

# Sample data
data = [
    ("James", "Smith", "M", 3000),
    ("Anna", "Rose", "F", 4100),
    ("Robert", "Williams", "M", 6200)
]

# Columns for the DataFrame
columns = ["firstname", "lastname", "gender", "salary"]

# Creating DataFrame
df = spark.createDataFrame(data, schema=columns)

# Display original DataFrame
df.show()

# Drop the 'gender' column
df_dropped = df.drop("gender")

# Display DataFrame after dropping 'gender' column
df_dropped.show()

+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
|    James|   Smith|     M|  3000|
|     Anna|    Rose|     F|  4100|
|   Robert|Williams|     M|  6200|
+---------+--------+------+------+

+---------+--------+------+
|firstname|lastname|salary|
+---------+--------+------+
|    James|   Smith|  3000|
|     Anna|    Rose|  4100|
|   Robert|Williams|  6200|
+---------+--------+------+

Method 2: Using the `select` Method

Another way to remove columns is to use the `select` method to explicitly select the columns you want to keep, excluding the columns you want to remove.


# Select fields excluding 'gender'
df_selected = df.select("firstname", "lastname", "salary")

# Display DataFrame after selecting specific columns
df_selected.show()

+---------+--------+------+
|firstname|lastname|salary|
+---------+--------+------+
|    James|   Smith|  3000|
|     Anna|    Rose|  4100|
|   Robert|Williams|  6200|
+---------+--------+------+

These are two common methods to remove columns from a PySpark DataFrame. Both are efficient, and you can choose either based on your preference or coding style.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top