Installing Apache Spark on Linux Ubuntu: A Step-by-Step Guide

Apache Spark is a powerful open-source distributed computing system that provides a fast and general-purpose cluster-computing framework. Originally developed at UC Berkeley’s AMPLab, Spark has quickly gained popularity among data scientists and engineers for its ease of use and high-performance capabilities with large-scale data processing. In this comprehensive guide, we will walk through the steps of installing Apache Spark on a Linux Ubuntu system. Whether you’re a seasoned data professional or a beginner, this guide will help you set up Spark on your machine.

Prerequisites

Before we begin installing Apache Spark, make sure that your system meets the following prerequisites:

  • Operating System: Ubuntu 16.04 or later
  • Java: Oracle JDK or OpenJDK, version 8 or later
  • Scala: Although Spark comes with a built-in Scala version, having Scala installed on your system is beneficial for development.
  • Python: If you plan to use PySpark, Python 2.7 or later, or Python 3.4 or later should be installed.
  • Hardware: Minimum of 4GB of RAM; however, 8GB or more is recommended for improved performance.

Before proceeding with Spark installation, ensure that Java and Scala are correctly installed on your system by entering the following commands in your terminal:

$ java -version
$ scala -version

If Java and Scala are properly installed, the terminal will display their respective versions.

Step 1: Installing Java

If Java is not installed on your system, you can install it by following these steps:

$ sudo apt update
$ sudo apt install openjdk-8-jdk

After the installation, you can verify it by running ‘java -version’ again.

Step 2: Installing Scala

If Scala is not installed, you can install it using the following commands:

$ sudo apt-get install scala

Verify the Scala installation by running ‘scala -version’.

Step 3: Downloading and Installing Apache Spark

Once you have installed all the prerequisites, you can download Apache Spark from the official website. Here’s how you can do it:

$ wget https://www.apache.org/dyn/closer.lua/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
$ tar xvf spark-3.5.0-bin-hadoop3.tgz

Note: The version of Spark to download may change, so ensure to check the latest version on the Apache Spark website.

Step 4: Configuring Spark

To configure Spark, you’ll need to set environment variables and update the system’s PATH. Add the following lines to your user’s .bashrc file:

export SPARK_HOME=/path/to/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=/usr/bin/python3

Replace “/path/to/spark” with the actual directory path of Spark on your system. Then, source your .bashrc file to apply the changes.

$ source ~/.bashrc

Step 5: Starting Spark (For a Standalone Cluster)

With everything set up, you can start the Spark master and worker processes.

$ start-master.sh
$ start-worker.sh spark://your-master-ip:7077

Replace “your-master-ip” with the IP address of your Spark master node. Normally on a single node, you would use ‘localhost’. Whether you need to run the Spark Master Server and Spark Worker Nodes depends on your intended use of Apache Spark:

  1. For a Standalone Cluster: If you’re setting up a standalone Spark cluster, then yes, you do need to start both the Spark Master Server and the Spark Worker Nodes. This is because in a standalone cluster, the Master Server manages resource allocation across applications, and the Worker Nodes actually execute the tasks.
  2. For Single-Node Usage: If you’re only using Spark on a single machine (i.e., your local machine for development or testing purposes), you don’t necessarily need to explicitly start the Master and Worker services. Spark can run in local mode, where it uses the resources of the machine it’s running on.
  3. When Using a Resource Manager (like YARN or Mesos): If you’re integrating Spark with a resource manager like YARN or Mesos, you wouldn’t start the Spark Master and Worker nodes in the way you described. Instead, the resource manager takes over the job of managing Spark’s resource allocation and task scheduling.

Accessing Spark Web UI

After starting Spark, you can access the Spark Web UI to monitor the cluster and job progress. By default, the Spark Web UI runs on port 8080 on the master node. You can access it by navigating to http://localhost:8080 in your web browser.

Step 6: Running a Spark Job

To ensure that Spark is installed and configured correctly, you can run an example provided with the Spark distribution:

$ spark-submit --class org.apache.spark.examples.JavaWordCount --master local $SPARK_HOME/examples/jars/spark-examples_2.12-3.5.0.jar input.txt

This command runs the JavaWordCount example using the local master, on a text file input.txt. Counts the number of occurrences of each word in a given text file (input.txt file)

Step 7: Using the Spark Shell

The Spark shell provides an interactive environment to run Spark code. To start the Spark shell, simply run the ‘spark-shell’ command in your terminal:

$ spark-shell

This will open the Scala-based Spark shell where you can interactively type and execute Spark commands.

Conclusion

You have successfully installed Apache Spark on your Linux Ubuntu system! With Spark installed, you can start building and running your big data applications. Keep in mind that Spark can also be configured to run in a distributed environment, and in such cases, the installation and setup process may require additional steps such as setting up a cluster manager like Apache Hadoop YARN or Apache Mesos and configuring network settings.

Whether you are using Spark for data processing, machine learning, or stream processing, having Spark installed on a local environment can be beneficial for testing, development, and learning purposes. As you become more familiar with Spark, you’ll be able to leverage its full potential in processing large datasets quickly and efficiently.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top