When working with PySpark, you might encounter situations where you need to delete columns from a DataFrame. This can be accomplished using several methods such as the `drop` method or selecting specific columns using the `select` method without the columns you want to remove. Below, I’ll explain these methods with detailed explanations and examples in PySpark.
Method 1: Using the `drop` Method
The `drop` method can be used to remove one or more columns from a DataFrame. This method creates a new DataFrame with the specified columns removed.
from pyspark.sql import SparkSession
# Initialize SparkSession
spark = SparkSession.builder.appName("DeleteColumnsExample").getOrCreate()
# Sample data
data = [
("James", "Smith", "M", 3000),
("Anna", "Rose", "F", 4100),
("Robert", "Williams", "M", 6200)
]
# Columns for the DataFrame
columns = ["firstname", "lastname", "gender", "salary"]
# Creating DataFrame
df = spark.createDataFrame(data, schema=columns)
# Display original DataFrame
df.show()
# Drop the 'gender' column
df_dropped = df.drop("gender")
# Display DataFrame after dropping 'gender' column
df_dropped.show()
+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
| James| Smith| M| 3000|
| Anna| Rose| F| 4100|
| Robert|Williams| M| 6200|
+---------+--------+------+------+
+---------+--------+------+
|firstname|lastname|salary|
+---------+--------+------+
| James| Smith| 3000|
| Anna| Rose| 4100|
| Robert|Williams| 6200|
+---------+--------+------+
Method 2: Using the `select` Method
Another way to remove columns is to use the `select` method to explicitly select the columns you want to keep, excluding the columns you want to remove.
# Select fields excluding 'gender'
df_selected = df.select("firstname", "lastname", "salary")
# Display DataFrame after selecting specific columns
df_selected.show()
+---------+--------+------+
|firstname|lastname|salary|
+---------+--------+------+
| James| Smith| 3000|
| Anna| Rose| 4100|
| Robert|Williams| 6200|
+---------+--------+------+
These are two common methods to remove columns from a PySpark DataFrame. Both are efficient, and you can choose either based on your preference or coding style.