Retrieving the name of a DataFrame column in PySpark is relatively straightforward. PySpark DataFrames have a `columns` attribute that returns a list of names of each column in the DataFrame.
Using the `columns` Attribute
You can use the `columns` attribute directly on the DataFrame object. Here is an example:
from pyspark.sql import SparkSession
# Initialize SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
# Create a DataFrame
data = [(1, "Alice"), (2, "Bob")]
columns = ["ID", "Name"]
df = spark.createDataFrame(data, columns)
# Retrieve column names
column_names = df.columns
print(column_names)
The output will be a list of column names:
['ID', 'Name']
Using the `dtypes` Attribute
Another way to retrieve columns, along with their data types, is by using the `dtypes` attribute:
# Retrieve column names along with their data types
column_info = df.dtypes
print(column_info)
The output will be a list of tuples where each tuple contains the column name and its data type:
[('ID', 'bigint'), ('Name', 'string')]
Using Schema
If you want more detailed information about columns, you can use the schema attribute of the DataFrame:
# Retrieve schema
schema_info = df.schema
print(schema_info)
The output will be:
StructType(List(StructField(ID,LongType,true),StructField(Name,StringType,true)))
You can also iterate through `schema` to get each column’s details separately:
for field in df.schema.fields:
print(f"Column Name: {field.name}, Data Type: {field.dataType}")
The output will be:
Column Name: ID, Data Type: LongType
Column Name: Name, Data Type: StringType
These are the common methods to retrieve the name of a DataFrame column in PySpark. You can choose any of them based on your specific use case and the level of detail required.