How to Determine the Data Type of a Column Using PySpark?

Determining the data type of a column in a DataFrame is a common operation when working with Apache Spark. PySpark, the Python API for Spark, provides a straightforward way to achieve this. Below are the steps along with code snippets and explanations for determining the data type of a column using PySpark.

Using the `dtypes` Attribute

The `dtypes` attribute of a DataFrame returns a list of tuples, where each tuple contains the column name and its corresponding data type.

Let’s create a sample DataFrame and determine the data types of its columns.


from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder \
    .appName("Determine Data Types") \
    .getOrCreate()

# Sample data
data = [("Alice", 25), ("Bob", 30), ("Cathy", 28)]

# Define schema
schema = ["Name", "Age"]

# Create DataFrame
df = spark.createDataFrame(data, schema)

# Show the DataFrame
df.show()

# Output the data types of each column
column_data_types = df.dtypes
print(column_data_types)

The output for the given code snippet would be:


+-----+---+
| Name|Age|
+-----+---+
|Alice| 25|
|  Bob| 30|
|Cathy| 28|
+-----+---+

[('Name', 'string'), ('Age', 'bigint')]

The `dtypes` attribute returns a list of tuples where each tuple indicates the column name and its data type.

Accessing Data Type of a Specific Column

If you are interested in the data type of a specific column, you can filter the `dtypes` list as shown below:


# Get the data type of a specific column, e.g., 'Age'
age_data_type = next(dt for name, dt in df.dtypes if name == 'Age')
print(f"The data type of the 'Age' column is: {age_data_type}")

The output for the given code snippet would be:


The data type of the 'Age' column is: bigint

Using the `schema` Attribute

Another way to determine the data type of a column is by using the `schema` attribute, which returns the schema of the DataFrame in a more structured format.


# Get the schema of the DataFrame
schema = df.schema

# Print the schema
print(schema)

The output for the given code snippet would be:


StructType(List(StructField(Name,StringType,true), StructField(Age,LongType,true)))

You can get the data type of a specific column from the schema by filtering the StructFields:


# Get the data type of a specific column, e.g., 'Age', using schema attribute
age_data_type_schema = [field.dataType for field in schema if field.name == 'Age'][0]
print(f"The data type of the 'Age' column is: {age_data_type_schema}")

The output for the given code snippet would be:


The data type of the 'Age' column is: LongType

By leveraging the `dtypes` and `schema` attributes, you can easily determine the data types of columns in a PySpark DataFrame. This information is useful for debugging, type checks, and ensuring that the data conforms to the expected schema.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top