How Does Apache Spark Work Internally?

Apache Spark is a distributed computing framework designed for processing large-scale data efficiently and quickly. It does this by dividing tasks among multiple nodes in a cluster, and it uses a combination of in-memory computing and directed acyclic graph (DAG) scheduling to optimize execution. Below, we explore the internal workings of Apache Spark in greater detail.

Contents hide

1 High-Level Architecture

1.1 Spark Driver

1.2 Spark Executors

1.3 Cluster Manager

2 Detailed Functioning of Apache Spark

2.1 1. Job Submission

2.2 2. Directed Acyclic Graph (DAG) Creation

2.3 3. Task Scheduling and Execution

2.4 4. In-Memory Computation

2.5 5. Fault Tolerance

2.6 6. Job Completion

3 Example Code Snippet

4 About Editorial Team

5 You Might Also Like:

High-Level Architecture

Spark consists of three main components:

Spark Driver

The driver is the Spark component responsible for converting a user program into a unit of work, usually tasks, that can be executed by the executors. It holds the Spark context, creates the DAG, and schedules tasks. The driver also collects results from all the tasks executed.

Spark Executors

Executors are distributed worker nodes in the cluster where Spark runs its tasks. Each executor runs multiple tasks and reports the status and results back to the driver.

Cluster Manager

Spark can run on various cluster managers like YARN, Apache Mesos, Kubernetes, or its standalone cluster manager. The cluster manager is responsible for resource management and job scheduling.

Detailed Functioning of Apache Spark

1. Job Submission

When a Spark application is submitted, the driver program is initialized, and a SparkContext is created. This SparkContext communicates with the cluster manager to allocate resources across the cluster.

2. Directed Acyclic Graph (DAG) Creation

Once the application code is parsed, Spark constructs a logical DAG of stages that represents the transformations to be applied to the data. Each stage contains a set of tasks based on the partitioning of the data.

3. Task Scheduling and Execution

The DAG is submitted to the DAG Scheduler, which divides it into stages and tasks. The tasks are then sent to the Task Scheduler, which distributes them to the available executors.

4. In-Memory Computation

Spark leverages in-memory computing using the Resilient Distributed Dataset (RDD) abstraction. Data can be cached in memory across the cluster, reducing the need for expensive disk I/O operations.

5. Fault Tolerance

RDDs provide fault tolerance through lineage information. If a partition of an RDD is lost, it can be recomputed using the lineage information from the original data source or from intermediate transformations.

6. Job Completion

Throughout the execution of tasks, results are sent back to the driver. Once all tasks are completed, the job is considered finished, and the driver process can collect and further process the results.

Example Code Snippet

Let’s take a simple example of how Spark works with PySpark:


from pyspark import SparkConf, SparkContext

# Initialize Spark Context
conf = SparkConf().setAppName("exampleApp")
sc = SparkContext(conf=conf)

# Create RDD from a list
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]
rdd = sc.parallelize(data)

# Define transformations
rdd2 = rdd.map(lambda x: x * x)
rdd3 = rdd2.filter(lambda x: x > 20)

# Action to collect results
result = rdd3.collect()

print(result)


# Output of the code
[25, 36, 49, 64, 81]

In this example:

The driver initializes a Spark context and creates an RDD from a list of numbers.
Transformations are applied to create a new RDD (`rdd2` squares each number in the list, `rdd3` filters out the numbers less than or equal to 20).
An action is executed to collect the results back to the driver and print them.

This simple example highlights how Spark divides a job into tasks, distributes them across nodes, and collects the results. Understanding the internal workings of Spark helps in writing optimized code and effectively troubleshooting performance issues.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.