To safely terminate a running Spark application, it’s essential to do so in a manner that ensures the application’s data and state are preserved correctly. Simply killing the process may result in data corruption or incomplete processing. Below are the recommended approaches:
1. Graceful Shutdown Using `spark.stop()`
2. Utilizing Cluster Manager Interfaces
3. Sending Signals to the Application Process
Let’s explore each method in detail:
1. Graceful Shutdown Using `spark.stop()`
The most recommended approach to safely terminate a Spark application is to call the `spark.stop()` method. This method ensures that all the resources are cleaned up properly. It is usually placed at the end of your Spark job.
# Python (PySpark) Example
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SampleApp").getOrCreate()
# Your data processing logic goes here
# When all processing is done
spark.stop()
// Scala Example
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.appName("SampleApp").getOrCreate()
// Your data processing logic goes here
// When all processing is done
spark.stop()
Output:
[spark-stop] INFO SparkContext: Successfully stopped SparkContext
INFO ShutdownHookManager: Shutdown complete.
2. Utilizing Cluster Manager Interfaces
If you are running your Spark application on a cluster manager like YARN, Mesos, or Kubernetes, you can use their respective interfaces to terminate the Spark job safely. These interfaces usually provide ways to gracefully stop a running application.
YARN
Use the `yarn application -kill
yarn application -kill application_1234567890123_0001
Mesos
You can use the Mesos UI or the Mesos command-line tool to stop the application.
Kubernetes
Use the `kubectl delete pod
kubectl delete pod spark-pi-driver
3. Sending Signals to the Application Process
You can send specific signals to the Spark application’s process to trigger safe termination. For instance, sending a SIGTERM signal typically allows the JVM to shut down gracefully, executing any shutdown hooks if configured.
UNIX
# Identify the process ID (PID) of your Spark driver
ps aux | grep "org.apache.spark.deploy.SparkSubmit"
# Send SIGTERM signal to the process
kill -15 <PID>
Output:
INFO StandaloneSchedulerBackend: Asking each executor to shut down
INFO SparkContext: Successfully stopped SparkContext
Windows
On Windows, you can use the `taskkill` command with `/F` flag to forcefully and `/T` flag to terminate the process and any child processes.
taskkill /PID <PID> /T
Note that using `/F` will forcefully terminate the process and might not allow for a graceful shutdown. Therefore, it is less recommended.
Ensuring a safe termination of a Spark application prevents data corruption and ensures that all tasks are completed as expected. Always prefer using methods like `spark.stop()` or cluster manager interfaces whenever possible.