How to Load a CSV File with PySpark: A Step-by-Step Guide

Loading a CSV file with PySpark involves initializing a Spark session, reading the CSV file, and performing operations on the DataFrame. Here’s a step-by-step guide:

Contents hide

1 Step 1: Initialize Spark Session

2 Step 2: Read the CSV File

3 Step 3: Show DataFrame Content

4 Step 4: Print DataFrame Schema

5 Step 5: Perform Data Analysis

6 About Editorial Team

7 You Might Also Like:

Step 1: Initialize Spark Session

First, we need to initialize a Spark session. This is the entry point for any Spark-related application.


from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder \
    .appName("CSV Loader") \
    .getOrCreate()

Step 2: Read the CSV File

Use the `spark.read.csv` method to read the CSV file into a DataFrame. You can configure various options like header presence, schema inference, delimiter, etc.


# Define the path to the CSV file
csv_file_path = "/path/to/your/file.csv"

# Read the CSV file into a DataFrame
df = spark.read.csv(csv_file_path, header=True, inferSchema=True)

Step 3: Show DataFrame Content

Use the `show` method to display the content of the DataFrame.


# Show the content of the DataFrame
df.show()

Output might look something like this:


+----+-------+----+----------+
| id |  name |age |     date |
+----+-------+----+----------+
|  1 | Alice | 25 |2021-01-01|
|  2 |   Bob | 30 |2021-02-01|
|  3 |Charlie| 35 |2021-03-01|
+----+-------+----+----------+

Step 4: Print DataFrame Schema

Use the `printSchema` method to display the schema of the DataFrame.


# Print the schema of the DataFrame
df.printSchema()

Output might look something like this:


root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- date: string (nullable = true)

Step 5: Perform Data Analysis

You can now perform various operations on the DataFrame such as filtering, grouping, aggregating, etc.


# Filter rows where age is greater than 30
filtered_df = df.filter(df.age > 30)

# Show the filtered DataFrame
filtered_df.show()

Output might look something like this:


+----+-------+----+----------+
| id |  name |age |     date |
+----+-------+----+----------+
|  3 |Charlie| 35 |2021-03-01|
+----+-------+----+----------+

And that’s it! You’ve successfully loaded a CSV file using PySpark and performed some basic data operations on it.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Step 1: Initialize Spark Session

Step 2: Read the CSV File

Step 3: Show DataFrame Content

Step 4: Print DataFrame Schema

Step 5: Perform Data Analysis

About Editorial Team

You Might Also Like:

Leave a Comment Cancel Reply