July 26, 2020

Basic CloudWatch Setup

Following on from the previous post, I needed to setup some basic alerting and dashboards. Opsgenie is being used for paging, but anything that can integrate with SNS will work well. I’ll walk through here how to create alerts in CloudWatch from the metrics, and a basic dashboard using CloudFormation.

I have done a quick bit of maths and if you are using this approach, it will cost about US$6.00 per instance. If you have a lot of instances, you may want to consider what that means for you.

Opsgenie and SNS setup

To add a new SNS Topic that can send messages to Opsgenie is really straight forward. Opsgenie have a great guide.

Once you have your API key, you can create the topics really simply with this CloudFormation:

AWSTemplateFormatVersion: "2010-09-09"
Description: SNS Topic for Opsgenie
Parameters:
  OpsgenieAPIKey:
    Type: String
    Description: Your Opsgenie API Key.
    NoEcho: true
Resources:
  opsgeniesns:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: opsgenie
      Subscription:
        - Endpoint: !Join [ '', [ 'https://api.opsgenie.com/v1/json/cloudwatch?apiKey=', !Ref OpsgenieAPIKey ]]
          Protocol: https

I deployed this within my ControlTower/Stacksets configuration, but you can deploy this into a single account.

CloudWatch Configuration

CWAgent Metrics

In the last post we created a basic configuration. I’ll quickly recap here:

/etc/basic_cloudwatch.json

{
    "agent": {
        "metrics_collection_interval": 300,
        "run_as_user": "root"
    },
    "metrics": {
        "append_dimensions": {
            "InstanceId": "${aws:InstanceId}"
        },
        "metrics_collected": {
            "disk": {

                "drop_device": true,
                "ignore_file_system_types": [
                    "overlay",
                    "sysfs",
                    "devtmpfs",
                    "tmpfs",
                    "devtmpfs",
                    "nfs4"
                ],
                "measurement": [
                    "used_percent",
                    "inodes_free"
                ],
                "resources": [
                    "*"
                ]
            },
            "mem": {
                "measurement": [
                    "mem_used_percent"
                ]
            },
            "swap": {
                "measurement": [
                    "swap_used_percent"
                ]
            }
        }
    }
}

This is installed into the CloudWatch agent with the following command:

/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -s -m ec2 -c file:/etc/basic-cloudwatch.json

The default namespace is CWAgent, and you can find metrics from the OS in here:

CWAgent Namespace

CWAgent Metrics

Dashboard

Using these you can create dashboards along with AWS/EC2 metrics:

Example Dashboard

Alarms

And you can also create Alarms:

Example Alarms

CloudFormation

Of course because I need to deploy this several times, I’ve created a template:

AWSTemplateFormatVersion: "2010-09-09"
Description: Setup an instance for auto-healing and basic monitoring.
Parameters:
  InstanceId:
    Type: String
    Description: The instance Id you want setup with alarms.
  fstype:
    Type: String
    Description: The filesystem type used.
    AllowedValues:
      - xfs
      - ext4
    Default: xfs
  monitoru01bool:
    Type: String
    Description: Set this to true if you want/have a /u01 partition to monitor.
    AllowedValues:
      - true
      - false
    Default: false
  monitorbootbool:
    Type: String
    Description: Set this to true if you want/have a /boot partition to monitor.
    AllowedValues:
      - true
      - false
    Default: true
  AlarmSNSTopic:
    Type: String
    Description: Name of the SNS Topic to send alarms to (must be in the same region and account)
Conditions:
  monitoru01: !Equals [ !Ref monitoru01bool, true ]
  monitorboot: !Equals [ !Ref monitorbootbool, true ]
Resources:
  ec2recovery:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmDescription: Trigger a recovery when instance status check fails for 15 consecutive minutes.
      Namespace: AWS/EC2
      MetricName: StatusCheckFailed_System
      Statistic: Minimum
      Period: 60
      EvaluationPeriods: 15
      ComparisonOperator: GreaterThanThreshold
      Threshold: 0
      AlarmActions: [ !Sub "arn:aws:automate:${AWS::Region}:ec2:recover" ]
      Dimensions:
      - Name: InstanceId
        Value: !Ref InstanceId
  rootspacealarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      ComparisonOperator: GreaterThanThreshold
      MetricName: disk_used_percent
      Namespace: CWAgent
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 90
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
        - Name: path
          Value: /
        - Name: fstype
          Value: !Ref fstype
      AlarmActions: [ !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:${AlarmSNSTopic}"]
      OKActions: [ !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:${AlarmSNSTopic}"]
  bootspacealarm:
    Type: AWS::CloudWatch::Alarm
    Condition: monitorboot
    Properties:
      ComparisonOperator: GreaterThanThreshold
      MetricName: disk_used_percent
      Namespace: CWAgent
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 90
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
        - Name: path
          Value: /boot
        - Name: fstype
          Value: !Ref fstype
      AlarmActions: [ !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:${AlarmSNSTopic}"]
      OKActions: [ !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:${AlarmSNSTopic}"]
  u01spacealarm:
    Type: AWS::CloudWatch::Alarm
    Condition: monitoru01
    Properties:
      ComparisonOperator: GreaterThanThreshold
      MetricName: disk_used_percent
      Namespace: CWAgent
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 90
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
        - Name: path
          Value: /u01
        - Name: fstype
          Value: !Ref fstype
      AlarmActions: [ !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:${AlarmSNSTopic}"]
      OKActions: [ !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:${AlarmSNSTopic}"]
  memoryalarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      ComparisonOperator: GreaterThanThreshold
      MetricName: mem_used_percent
      Namespace: CWAgent
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 90
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
      AlarmActions: [ !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:${AlarmSNSTopic}"]
      OKActions: [ !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:${AlarmSNSTopic}"]
  swapalarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      ComparisonOperator: GreaterThanThreshold
      MetricName: swap_used_percent
      Namespace: CWAgent
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 90
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
      AlarmActions: [ !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:${AlarmSNSTopic}"]
      OKActions: [ !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:${AlarmSNSTopic}"]
  cpualarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      ComparisonOperator: GreaterThanThreshold
      MetricName: CPUUtilization
      Namespace: AWS/EC2
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 90
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
      AlarmActions: [ !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:${AlarmSNSTopic}"]
      OKActions: [ !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:${AlarmSNSTopic}"]
  cpuburstcreditalarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      ComparisonOperator: LessThanThreshold
      MetricName: CPUCreditBalance
      Namespace: AWS/EC2
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 100
      Dimensions:
        - Name: InstanceId
          Value: !Ref InstanceId
      AlarmActions: [ !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:${AlarmSNSTopic}"]
      OKActions: [ !Sub "arn:aws:sns:${AWS::Region}:${AWS::AccountId}:${AlarmSNSTopic}"]
  dashboard:
    Type: AWS::CloudWatch::Dashboard
    Properties:
      DashboardBody: !Sub '{
    "widgets": [
        {
            "type": "metric",
            "x": 0,
            "y": 0,
            "width": 12,
            "height": 9,
            "properties": {
                "metrics": [
                    [ "CWAgent", "disk_used_percent", "path", "/", "InstanceId", "${InstanceId}", "fstype", "xfs", { "label": "/ disk_used_percent" }  ],
                    [ ".", "disk_inodes_free", ".", ".", ".", ".", ".", ".", { "yAxis": "right", "label": "/ disk_inodes_free" } ],
                    [ "CWAgent", "disk_used_percent", "path", "/boot", "InstanceId", "${InstanceId}", "fstype", "xfs", { "label": "/boot disk_used_percent" } ],
                    [ ".", "disk_inodes_free", ".", ".", ".", ".", ".", ".", { "yAxis": "right", "label": "/boot disk_inodes_free" } ],
                    [ "CWAgent", "disk_used_percent", "path", "/u01", "InstanceId", "${InstanceId}", "fstype", "xfs", { "label": "/u01 disk_used_percent" } ],
                    [ ".", "disk_inodes_free", ".", ".", ".", ".", ".", ".", { "yAxis": "right", "label": "/u01 disk_inodes_free" } ]
                ],
                "view": "timeSeries",
                "stacked": false,
                "region": "${AWS::Region}",
                "title": "disk space",
                "stat": "Average",
                "period": 300,
                "yAxis": {
                    "left": {
                        "min": 0,
                        "max": 100
                    },
                    "right": {
                        "min": 0
                    }
                }
            }
        },
        {
            "type": "metric",
            "x": 12,
            "y": 0,
            "width": 12,
            "height": 9,
            "properties": {
                "metrics": [
                    [ "CWAgent", "mem_used_percent", "InstanceId", "${InstanceId}" ],
                    [ ".", "swap_used_percent", ".", "." ]
                ],
                "view": "timeSeries",
                "stacked": false,
                "region": "${AWS::Region}",
                "title": "memory and swap percent",
                "stat": "Average",
                "period": 300,
                "yAxis": {
                    "left": {
                        "min": 0,
                        "max": 100
                    },
                    "right": {
                        "min": 0,
                        "showUnits": false
                    }
                }
            }
        },
        {
            "type": "metric",
            "x": 0,
            "y": 9,
            "width": 12,
            "height": 9,
            "properties": {
                "metrics": [
                    [ "AWS/EC2", "CPUUtilization", "InstanceId", "${InstanceId}" ],
                    [ ".", "CPUCreditBalance", ".", ".", { "yAxis": "right" } ]
                ],
                "view": "timeSeries",
                "stacked": false,
                "region": "${AWS::Region}",
                "title": "CPU",
                "stat": "Average",
                "period": 300,
                "yAxis": {
                    "left": {
                        "min": 0,
                        "max": 100
                    },
                    "right": {
                        "min": 0,
                        "showUnits": true
                    }
                }
            }
        },
        {
            "type": "metric",
            "x": 12,
            "y": 9,
            "width": 12,
            "height": 9,
            "properties": {
                "metrics": [
                    [ "AWS/EC2", "NetworkIn", "InstanceId", "${InstanceId}" ],
                    [ ".", "NetworkPacketsIn", ".", ".", { "yAxis": "right" } ],
                    [ ".", "NetworkPacketsOut", ".", ".", { "yAxis": "right" } ],
                    [ ".", "NetworkOut", ".", "." ]
                ],
                "view": "timeSeries",
                "stacked": false,
                "region": "${AWS::Region}",
                "title": "Network",
                "stat": "Average",
                "period": 300,
                "yAxis": {
                    "right": {
                        "showUnits": true,
                        "min": 0
                    },
                    "left": {
                        "min": 0
                    }
                }
            }
        }
    ]
  }'
      DashboardName: !Ref InstanceId

This has a few specifics for the environment I’m dealing with, but can easily be modifid to fit your environment.

You can deploy this easily from here:

Summary

As you can see it is really easy to get some basic monitoring setup to alarm you via Opsgenie and also to have a simple dashboard to provide a view of your EC2 system performance.

© Greg Cockburn

Powered by Hugo & Kiss.