Apache Spark on YARN stores logs in specific locations depending on various factors such as the configuration of the cluster, the Hadoop and YARN configurations, and the location of the application logs. Understanding where these logs are stored is crucial for debugging and monitoring purposes.
Log Storage in Spark on YARN
When Spark runs on YARN, logs are typically stored in the following locations:
1. Driver Logs
The Spark driver program’s logs capture the operations performed by the driver, including scheduling and some application logic. These logs are usually stored in the directory where the driver program is executed or can be directed to a specific location using logging configurations.
2. Executor Logs
Executor logs capture the operations performed by the Spark executors. These logs are vital for understanding the performance and issues encountered during task execution.
3. YARN Container Logs
Since Spark runs on YARN, each Spark executor runs inside a YARN container, and the logs for these containers are managed by YARN. The container logs can be accessed through the YARN ResourceManager web interface or directly on the nodes where the containers are running.
Key Locations for Logs Storage
1. Local Node Directory
On each worker node, YARN containers store their logs in a directory specified by the YARN configuration `yarn.nodemanager.log-dirs`. By default, this is usually something like `/var/log/hadoop-yarn/containers`. Each application will have its own subdirectory within this directory.
# Example YARN configuration (yarn-site.xml)
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/var/log/hadoop-yarn/containers</value>
</property>
2. HDFS (Hadoop Distributed File System)
If the configuration `yarn.log-aggregation-enable` is set to `true`, YARN aggregates the logs and stores them in HDFS. The location in HDFS is determined by the configuration `yarn.nodemanager.remote-app-log-dir`. The aggregated logs provide a centralized method of accessing container logs once the application is finished.
# Example HDFS log aggregation configuration (yarn-site.xml)
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/app-logs</value>
</property>
Logs can then be found in HDFS under the configured directory, usually `/app-logs`. Each application’s logs are further sub-divided by application ID.
3. YARN ResourceManager Web UI
The ResourceManager Web UI provides a way to access logs for running and completed applications. You can navigate to the specific application and find links to the logs for each container.
For example, if the YARN ResourceManager runs on `localhost` with port `8088`, you can access it via:
http://localhost:8088
From the ResourceManager UI, you can drill down into the specific application and access the logs for each container, including the driver and executor logs.
In summary, Spark on YARN logs can be accessed through local node directories, HDFS for aggregated logs, and the YARN ResourceManager Web UI. Understanding these locations is crucial for efficient debugging and monitoring of Spark applications running on YARN.