artur-rodrigues.com

Fetching real CPU load from within an EC2 instance

by

When trying to monitor the CPU utilization of a Linux machine, it is common to look for the CPU load through top or uptime:

load average: 1.01, 0.75, 0.63

It is perfectly fine for bare metal servers. However, most times when running a virtualized server, you are given an instance with fractions of the resources of the host server. This way, the measurements from top, sar and others can be deceiving, as they are reading the metrics from the physical cores of the host servers. This is true to EC2 instances but fortunately, AWS also provides load metrics of the guest hosts through CloudWatch with minute granularity.

Another great thing about EC2 instances is their ability to introspect. It is called Instance Metadata and User Data and allows an instance to query its own id, as shown:

$ http http://169.254.169.254/latest/meta-data/instance-id
HTTP/1.0 200 OK
Accept-Ranges: bytes
Connection: keep-alive
Content-Length: 10
Content-Type: text/plain
Date: Tue, 04 Aug 2015 12:42:19 GMT
ETag: "2717422980"
Last-Modified: Tue, 07 Jul 2015 15:33:52 GMT
Server: EC2ws

i-6ce03970

With the id, it is a matter of making the appropriate calls to CloudWatch, given the right permissions, to get the instance load. I’ve written a Nagios compatible Python monitoring script that receives two thresholds, one for warning state (exit code 1) and another one for critical state (exit code 2):

#!/usr/bin/python

from sys import exit
from argparse import ArgumentParser
from datetime import datetime, timedelta
from operator import itemgetter
from requests import get
from boto3.session import Session


parser = ArgumentParser(description='EC2 load checker')
parser.add_argument(
    '-w', action='store', dest='warn_threshold', type=float, default=0.85)
parser.add_argument(
    '-c', action='store', dest='crit_threshold', type=float, default=0.95)
arguments = parser.parse_args()

session = Session(
    aws_access_key_id='ACCESS_KEY',
    aws_secret_access_key='SECRET_KEY',
    region_name='us-east-1')
cw = session.client('cloudwatch')

instance_id = get(
    'http://169.254.169.254/latest/meta-data/instance-id').content

now = datetime.utcnow()
past = now - timedelta(minutes=30)
future = now + timedelta(minutes=10)

results = cw.get_metric_statistics(
    Namespace='AWS/EC2',
    MetricName='CPUUtilization',
    Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
    StartTime=past,
    EndTime=future,
    Period=300,
    Statistics=['Average'])

datapoints = results['Datapoints']
last_datapoint = sorted(datapoints, key=itemgetter('Timestamp'))[-1]
utilization = last_datapoint['Average']
load = round((utilization/100.0), 2)
timestamp = str(last_datapoint['Timestamp'])
print("{0} load at {1}".format(load, timestamp))

if load < arguments.warn_threshold:
    exit(0)
elif load > arguments.crit_threshold:
    exit(2)
else:
    exit(1)

Notice that I’m using a 5-minute average, which is the default for launched instances without detailed monitoring enabled. Remember to use your own credentials and to install both requests and boto3.

$ /usr/lib/nagios/plugins/check_ec2_load -w 0.7 -c 0.9
0.37 load at 2015-08-04 11:48:00+00:00

$ echo $?
0