AWS EMR Memory Scaling

Unhealthy cluster due to high memory usage.
  • Looking at a cluster and manually finding the problem was time confusing and not effective.
  • Indefinite number of nodes, due to autoscaling
Cluster memory snapshot using command df -h for the core nodes.
  1. There is always at least 23% of utilization. Means there was some memory never cleaned.
  2. There is five spikes, i.e utilization >80%, resulting in unhealthy cluster. The static memory allocation was not enough for the load.
  1. user, home is almost constant.
  2. var is increasing
  3. tmp is slowing increasing
  • Spark Application history logs
  • Hive temporary logs
  • YARN containers logs
  • Localized files during an Hadoop/spark job run using YARN framework
  1. Connect to the master node using SSH.
  2. Open the /etc/spark/conf/spark-defaults.conf file on the master node.
  3. Reduce the value of the spark.history.fs.cleaner.maxAge property.
  1. appcahe : During a MapReduce job, intermediate data and working files are written to temporary local files. Because this data includes the potentially very large output of map tasks, you need to ensure that the yarn.nodemanager.local-dirs property, which controls the location of local temporary storage for YARN containers, is configured to use disk partitions that are large enough.
  2. filecache: — During resource localization by YARN NM i.e NM downloads resources from the supported source (such as HDFS, HTTP, and so on) to the NodeManager node’s local directory.
    After the job finishes, the Node Managers automatically clean up the localized files immediately by default. TroubleShoot inside and see if application logs are showing for applications that are currently running.
    Change the below configs:
    yarn.nodemanager.localizer.cache.cleanup.interval-ms : Interval in between cache cleanups. : Target size of localizer cache in MB, per local directory.
    Restart NodeManager after resetting the configs.
    sudo stop hadoop-yarn-nodemanager
    sudo start hadoop-yarn-nodemanager
Cluster Snapshot after deploying the solution.




Engineer, Love to read/write stuff.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Let’s build Service Discovery Server — System Design

QuickBooks POS: Vendor Payment Terms

The Why and What of DevRev

What factors determine the cost of custom software development?

Custom software development costs

Detecting And Countering Top Down Solutions

How to sort your data in excel: 4 ways of sorting the data in excel

How to sort your data in excel

Oracle partition table 搬移tablespace筆記

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Laveena Bachani

Laveena Bachani

Engineer, Love to read/write stuff.

More from Medium

Multipass: Easy Virtual Machine

Working with Amazon Elastic Container Service

How to Create an encrypted AWS RDS Database from an unencrypted Database

AWS Lambda: Environment variables