AWS EMR Disk Scaling

Laveena Bachani
5 min readJan 24, 2022

It is a common problem in AWS EMR cluster to be out of disk while running heavy load job, we can scale the computer nodes with adding more Task nodes at dynamic run time but for disk scaling there is no explicit solution.

We will explore who to manage the disk as well as dynamically scale it.

Unhealthy cluster due to high memory usage.

Debugging running out of disk space:

First of all, how to debug if the cluster is out of disk. There are challenges:

  • Looking at a cluster and manually finding the problem was time confusing and not effective.
  • Indefinite number of nodes, due to autoscaling

Solution for debugging: Take snapshots of the cluster at regular interval through logging linux memory commands. We can take snapshots for all the nodes were taken through parallel ssh from master node to all the core and task nodes. We used yarn commands to know the task node and core node IP addresses. Linux commands such as df-h, du and hdfs dfs -du -h helped in knowing which folder and memory partitions aggressively increase with load.

Cluster memory snapshot using command df -h for the core nodes.

Two observations:

  1. There is always at least 23% of utilization. Means there was some memory never cleaned.
  2. There is five spikes, i.e utilization >80%, resulting in unhealthy cluster. The static memory allocation was not enough for the load.

Other observation through logs for HDFS for the Experian cluster:

  1. user, home is almost constant.
  2. var is increasing
  3. tmp is slowing increasing

1. Purging Existing files:
For first problem, we will purge the existing files in the storage. EMR core nodes can fill up due to various reasons:

  • Spark Application history logs
  • Hive temporary logs
  • YARN containers logs
  • Localized files during an Hadoop/spark job run using YARN framework

Explaining issues one by one:

Yarn logs (/mnt/var/log/hadoop-yarn/) :

The container logs on local machines should be ideally deleted by components in this order.

By YARN Nodemanager after log aggregation. — (Logic Altered by EMR team)
By LogPusher after retention period.
IC’s DSM when its heuristics are satisfied

In YARN , If log aggregation is turned on (with the yarn.log-aggregation-enable config), when the spark application is completed , container logs are copied to HDFS and after post-aggregation. they are expected to be deleted from the local machine by NodeManager’s AppLogAggregatorImpl. However on EMR , we seem to keep it on local machines because we need those logs for logpusher to push them to S3.(logpusher cannot push logs from HDFS). By default, YARN keeps application logs on HDFS for 48 hours. Change the config yarn.log-aggregation.retain-seconds on /etc/hadoop/conf/yarn-site.xml property on all nodes. After the application logs are copied to HDFS, they remain on the local disk so that Log Pusher can push the logs to Amazon Simple Storage Service (Amazon S3). The default retention period is four hours. To reduce the retention period, modify the /etc/logpusher/hadoop.config file.

NodeManager should be restarted after the chnage in the configs yarn-site.xml configs:
sudo stop hadoop-yarn-nodemanager
sudo start hadoop-yarn-nodemanager

Spark Application history logs (hdfs:///var/log/spark/apps/)

Spark job history files are located in /var/log/spark/apps on the core node. When the filesystem cleaner runs, Spark deletes job history files that are older than seven days.To reduce the default retention period, perform the following steps:

  1. Connect to the master node using SSH.
  2. Open the /etc/spark/conf/spark-defaults.conf file on the master node.
  3. Reduce the value of the spark.history.fs.cleaner.maxAge property.

By default, the filesystem history cleaner runs once a day. The frequency is specified in the spark.history.fs.cleaner.interval property.
Restarted spark history server after changing these configs.
sudo stop spark-history-server
sudo start spark-history-server

Hive temporary logs (hdfs:///tmp/hive/)

Dangling files that due to abnormal termination of hive session creates files in hdfs:///tmp/hive/ and hdfs:///tmp/hive/<user>/_tez_session_dir. There are few configs and tools that can clean these i.e setting up hive.scratchdir.lock and running cleardanglingscratchdir tool. But unfortunately it did not work for me practically. So I ran a cron job to delete files that are not touched in last 6 hours job.sh.template .

Localized files during an Hadoop/spark job run
/mnt/yarn/ can fill up for different reasons, it created two folders:

  1. appcahe : During a MapReduce job, intermediate data and working files are written to temporary local files. Because this data includes the potentially very large output of map tasks, you need to ensure that the yarn.nodemanager.local-dirs property, which controls the location of local temporary storage for YARN containers, is configured to use disk partitions that are large enough.
  2. filecache: — During resource localization by YARN NM i.e NM downloads resources from the supported source (such as HDFS, HTTP, and so on) to the NodeManager node’s local directory.
    After the job finishes, the Node Managers automatically clean up the localized files immediately by default. TroubleShoot inside and see if application logs are showing for applications that are currently running.
    Change the below configs:
    yarn.nodemanager.localizer.cache.cleanup.interval-ms : Interval in between cache cleanups.
    yarn.nodemanager.localizer.cache.target-size-mb : Target size of localizer cache in MB, per local directory.
    yarn.nodemanager.delete.thread-count
    Restart NodeManager after resetting the configs.
    sudo stop hadoop-yarn-nodemanager
    sudo start hadoop-yarn-nodemanager

2. Static Increase Disk volume

Multiple applications(>=4) makes /mnt folder to reach 90% full of its capacity. This is due to /mnt/yarn on core node getting filled up. Based on the analysis, found out the original disk space wasn’t Sufficient to support the heavy application running on the cluster, even with the normal traffic. A single application can take up to 700–800 gb of the disk. Increment of the space is possible with EMR without rebooting the cluster.

3. Dynamic Increase Disk Space

EFS is amazon Elastic file system with supporting shared storage with thousands of EC2 and Scalable Performance. It has pay what you use benefit.We can mount EFS on or in place of any folder on EMR. Various aws blogs and my analysis suggest that /yarn/usercache folder can fill due to heavy workload.

Created a dataset of ~2 GB with 42 columns and 9100279 rows. Tested running hive queries and spark jobs on the cluster (1 master and 1 core node) with EFS mounted on /yarn/usercache folder in the core node. I was able to run queries successfully and observed the cost varying with the space utilized by the intermediate data by these jobs in the usercache folder.

Cluster Snapshot after deploying the solution.

--

--

Laveena Bachani

Honest stories from Tech Industry | AI @Microsoft | OpenAI | Writer for Women in Tech