AWS EMR Memory Scaling

Unhealthy cluster due to high memory usage.
  • Looking at a cluster and manually finding the problem was time confusing and not effective.
  • Indefinite number of nodes, due to autoscaling
Cluster memory snapshot using command df -h for the core nodes.
  1. There is always at least 23% of utilization. Means there was some memory never cleaned.
  2. There is five spikes, i.e utilization >80%, resulting in unhealthy cluster. The static memory allocation was not enough for the load.
  1. user, home is almost constant.
  2. var is increasing
  3. tmp is slowing increasing
  • Spark Application history logs
  • Hive temporary logs
  • YARN containers logs
  • Localized files during an Hadoop/spark job run using YARN framework
  1. Connect to the master node using SSH.
  2. Open the /etc/spark/conf/spark-defaults.conf file on the master node.
  3. Reduce the value of the spark.history.fs.cleaner.maxAge property.
  1. appcahe : During a MapReduce job, intermediate data and working files are written to temporary local files. Because this data includes the potentially very large output of map tasks, you need to ensure that the yarn.nodemanager.local-dirs property, which controls the location of local temporary storage for YARN containers, is configured to use disk partitions that are large enough.
  2. filecache: — During resource localization by YARN NM i.e NM downloads resources from the supported source (such as HDFS, HTTP, and so on) to the NodeManager node’s local directory.
    After the job finishes, the Node Managers automatically clean up the localized files immediately by default. TroubleShoot inside and see if application logs are showing for applications that are currently running.
    Change the below configs:
    yarn.nodemanager.localizer.cache.cleanup.interval-ms : Interval in between cache cleanups. : Target size of localizer cache in MB, per local directory.
    Restart NodeManager after resetting the configs.
    sudo stop hadoop-yarn-nodemanager
    sudo start hadoop-yarn-nodemanager
Cluster Snapshot after deploying the solution.




Engineer, Love to read/write stuff.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

The Humanness of Digital Transformation

C#: ref and out Keywords

How to Fix the "[IMPORTERROR] 'module' not found" error message after deploying your streamlit app.

35C3 2018 CTF Write up

Traefik: canary deployments with weighted load balancing

Understanding Azure: Regions, Availability Zones, and Paired Regions

World map showing pin-points of current and planned Azure Regions

Adding features to your chart on Android


Balance traffic with Nginx with SpringBoot v2.0+

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Laveena Bachani

Laveena Bachani

Engineer, Love to read/write stuff.

More from Medium

Monitor your cloud spend with AWS Accountant

Trigger AWS lambda function with s3 update !

Check out my first course on LinkedIn Learning: Security in Fintech — Essential Training

How to deploy a MySQL database using AWS RDS — a simple way!