How to Install PySpark on Linux: A Step-by-Step Guide

Install PySpark on Linux: – Apache Spark is a powerful open-source distributed computing system that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. PySpark is the Python API for Spark, allowing Python developers to harness the simplicity of Python while utilizing the capabilities of Apache Spark. Installing PySpark on a Linux system can seem daunting, but with this step-by-step guide, you’ll have it up and running in no time.

Contents hide

1 Prerequisites

1.1 1. Java

1.1.1 For Debian/Ubuntu:

1.1.2 For Red Hat/CentOS:

1.2 2. Python

1.2.1 For Debian/Ubuntu:

1.2.2 For Red Hat/CentOS:

1.3 3. Apache Spark

2 Step 1: Download Apache Spark

3 Step 2: Setup Environment Variables

4 Step 3: Install PySpark

5 Step 4: Verify Installation

6 Step 5: Running a Sample PySpark Job

7 Conclusion

8 About Editorial Team

9 You Might Also Like:

Prerequisites

Before installing PySpark, ensure that you have the following prerequisites installed on your Linux system:

1. Java

Apache Spark requires Java to be installed. You can check if Java is already installed by running:

java -version

If Java is not installed, you can install it using the following commands for different Linux distributions:

For Debian/Ubuntu:

sudo apt update
sudo apt install default-jdk

For Red Hat/CentOS:

sudo yum install java-1.8.0-openjdk

2. Python

PySpark requires Python to be installed. You can check if Python is already installed by running:

python3 --version

If Python is not installed, you can install it using the following commands for different Linux distributions:

For Debian/Ubuntu:

sudo apt update
sudo apt install python3

For Red Hat/CentOS:

sudo yum install python3

3. Apache Spark

You’ll need to download Apache Spark. Although this guide will walk you through downloading and setting up Spark, you can always refer to the official documentation for the latest instructions and downloads:

Apache Spark Downloads

Step 1: Download Apache Spark

To download Apache Spark, visit the official download page and grab the latest version suitable for your needs. As of the last update, the command to download the latest version is:

wget https://dlcdn.apache.org/spark/spark-3.5.2/spark-3.5.2-bin-hadoop3.tgz

Once downloaded, extract the archive using:

tar xvf spark-3.5.2-bin-hadoop3.tgz

Move the extracted folder to a convenient location:

sudo mv spark-3.5.2-bin-hadoop3 /opt/spark

Step 2: Setup Environment Variables

Setting up environment variables ensures that PySpark knows where to locate Spark. You’ll need to add the Spark installation directory to your PATH and set the SPARK_HOME environment variable.

Edit your `.bashrc` file:

nano ~/.bashrc

Add the following lines at the end:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin

To apply the changes, source the `.bashrc` file:

source ~/.bashrc

Step 3: Install PySpark

PySpark is a Python package and can be installed using pip:

pip install pyspark

Step 4: Verify Installation

To verify that PySpark is correctly installed and configured, you can run a simple PySpark job. Open a Python interpreter and try importing PySpark:

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("my_app").getOrCreate()

print(spark.version)

If everything is installed correctly, you should see the version of Spark printed without any errors. Here is an example output:

3.5.2

Step 5: Running a Sample PySpark Job

Let’s run a sample PySpark job to ensure everything is functioning as expected. Create a Python script named `sample_job.py` with the following content:

from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.master("local").appName("Sample Job").getOrCreate()

# Create a DataFrame
data = [("James", "Smith", "USA", 40), ("Anna", "Rose", "UK", 35), ("Robert", "Williams", "USA", 50)]
columns = ["First Name", "Last Name", "Country", "Age"]
df = spark.createDataFrame(data, columns)

# Display DataFrame
df.show()

Run the script using:

python3 sample_job.py

You should see output similar to:

+----------+---------+-------+---+
|First Name|Last Name|Country|Age|
+----------+---------+-------+---+
|     James|    Smith|    USA| 40|
|      Anna|     Rose|     UK| 35|
|    Robert| Williams|    USA| 50|
+----------+---------+-------+---+

Conclusion

Congratulations! You have successfully installed PySpark on your Linux system. With this setup, you are now ready to use PySpark for your big data processing needs. Apache Spark, combined with the ease of Python, provides a powerful tool for data analytics, machine learning, and more. Happy coding!

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.

Prerequisites

1. Java

For Debian/Ubuntu:

For Red Hat/CentOS:

2. Python

For Debian/Ubuntu:

For Red Hat/CentOS:

3. Apache Spark

Step 1: Download Apache Spark

Step 2: Setup Environment Variables

Step 3: Install PySpark

Step 4: Verify Installation

Step 5: Running a Sample PySpark Job

Conclusion

About Editorial Team

You Might Also Like:

Leave a Comment Cancel Reply