Determining the data type of a column in a DataFrame is a common operation when working with Apache Spark. PySpark, the Python API for Spark, provides a straightforward way to achieve this. Below are the steps along with code snippets and explanations for determining the data type of a column using PySpark.
Using the `dtypes` Attribute
The `dtypes` attribute of a DataFrame returns a list of tuples, where each tuple contains the column name and its corresponding data type.
Let’s create a sample DataFrame and determine the data types of its columns.
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder \
.appName("Determine Data Types") \
.getOrCreate()
# Sample data
data = [("Alice", 25), ("Bob", 30), ("Cathy", 28)]
# Define schema
schema = ["Name", "Age"]
# Create DataFrame
df = spark.createDataFrame(data, schema)
# Show the DataFrame
df.show()
# Output the data types of each column
column_data_types = df.dtypes
print(column_data_types)
The output for the given code snippet would be:
+-----+---+
| Name|Age|
+-----+---+
|Alice| 25|
| Bob| 30|
|Cathy| 28|
+-----+---+
[('Name', 'string'), ('Age', 'bigint')]
The `dtypes` attribute returns a list of tuples where each tuple indicates the column name and its data type.
Accessing Data Type of a Specific Column
If you are interested in the data type of a specific column, you can filter the `dtypes` list as shown below:
# Get the data type of a specific column, e.g., 'Age'
age_data_type = next(dt for name, dt in df.dtypes if name == 'Age')
print(f"The data type of the 'Age' column is: {age_data_type}")
The output for the given code snippet would be:
The data type of the 'Age' column is: bigint
Using the `schema` Attribute
Another way to determine the data type of a column is by using the `schema` attribute, which returns the schema of the DataFrame in a more structured format.
# Get the schema of the DataFrame
schema = df.schema
# Print the schema
print(schema)
The output for the given code snippet would be:
StructType(List(StructField(Name,StringType,true), StructField(Age,LongType,true)))
You can get the data type of a specific column from the schema by filtering the StructFields:
# Get the data type of a specific column, e.g., 'Age', using schema attribute
age_data_type_schema = [field.dataType for field in schema if field.name == 'Age'][0]
print(f"The data type of the 'Age' column is: {age_data_type_schema}")
The output for the given code snippet would be:
The data type of the 'Age' column is: LongType
By leveraging the `dtypes` and `schema` attributes, you can easily determine the data types of columns in a PySpark DataFrame. This information is useful for debugging, type checks, and ensuring that the data conforms to the expected schema.