Where Are Logs Stored in Spark on YARN?

Apache Spark on YARN stores logs in specific locations depending on various factors such as the configuration of the cluster, the Hadoop and YARN configurations, and the location of the application logs. Understanding where these logs are stored is crucial for debugging and monitoring purposes.

Contents hide

1 Log Storage in Spark on YARN

1.1 1. Driver Logs

1.2 2. Executor Logs

1.3 3. YARN Container Logs

2 Key Locations for Logs Storage

2.1 1. Local Node Directory

2.2 2. HDFS (Hadoop Distributed File System)

2.3 3. YARN ResourceManager Web UI

3 About Editorial Team

4 You Might Also Like:

Log Storage in Spark on YARN

When Spark runs on YARN, logs are typically stored in the following locations:

1. Driver Logs

The Spark driver program’s logs capture the operations performed by the driver, including scheduling and some application logic. These logs are usually stored in the directory where the driver program is executed or can be directed to a specific location using logging configurations.

2. Executor Logs

Executor logs capture the operations performed by the Spark executors. These logs are vital for understanding the performance and issues encountered during task execution.

3. YARN Container Logs

Since Spark runs on YARN, each Spark executor runs inside a YARN container, and the logs for these containers are managed by YARN. The container logs can be accessed through the YARN ResourceManager web interface or directly on the nodes where the containers are running.

Key Locations for Logs Storage

1. Local Node Directory

On each worker node, YARN containers store their logs in a directory specified by the YARN configuration `yarn.nodemanager.log-dirs`. By default, this is usually something like `/var/log/hadoop-yarn/containers`. Each application will have its own subdirectory within this directory.


# Example YARN configuration (yarn-site.xml)
<property>
  <name>yarn.nodemanager.log-dirs</name>
  <value>/var/log/hadoop-yarn/containers</value>
</property>

2. HDFS (Hadoop Distributed File System)

If the configuration `yarn.log-aggregation-enable` is set to `true`, YARN aggregates the logs and stores them in HDFS. The location in HDFS is determined by the configuration `yarn.nodemanager.remote-app-log-dir`. The aggregated logs provide a centralized method of accessing container logs once the application is finished.


# Example HDFS log aggregation configuration (yarn-site.xml)
<property>
  <name>yarn.log-aggregation-enable</name>
  <value>true</value>
</property>
<property>
  <name>yarn.nodemanager.remote-app-log-dir</name>
  <value>/app-logs</value>
</property>

Logs can then be found in HDFS under the configured directory, usually `/app-logs`. Each application’s logs are further sub-divided by application ID.

3. YARN ResourceManager Web UI

The ResourceManager Web UI provides a way to access logs for running and completed applications. You can navigate to the specific application and find links to the logs for each container.

For example, if the YARN ResourceManager runs on `localhost` with port `8088`, you can access it via:


http://localhost:8088

From the ResourceManager UI, you can drill down into the specific application and access the logs for each container, including the driver and executor logs.

In summary, Spark on YARN logs can be accessed through local node directories, HDFS for aggregated logs, and the YARN ResourceManager Web UI. Understanding these locations is crucial for efficient debugging and monitoring of Spark applications running on YARN.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.