This content is part of the Essential Guide: Cloud network management begins with the basics

Identify CloudWatch metrics for real-time application monitoring

IT teams running real-time apps on AWS must monitor application health and performance. Amazon CloudWatch is a useful tool for monitoring SQS, Kinesis and Lambda.

Real-time applications running on AWS are easily scalable, highly maintainable and quick to update. But it's difficult...

to test and monitor real-time apps build using microservices in AWS.

Real-time application monitoring requires more than just avoiding errors; administrators must also ensure that events process on time. Amazon CloudWatch enables real-time application monitoring for AWS utilities, including Amazon Simple Queue Service (SQS), Amazon Kinesis and AWS Lambda functions, to prevent backlogs.

Monitor SQS

While SQS is suitable for near-real-time apps that require reliability than speed, it's also very efficient for managing costs and scaling services that may take longer than five minutes to process. Because SQS is a true queue, CloudWatch metrics monitor the number of messages being processed, the number of messages written and the number of deleted messages.

One key metric that often is overlooked is examining the number of messages deleted. IT teams can use this metric to determine if processing engines are running. For example, when processing news releases, some teams know that if the processing engine hasn't deleted a message from the queue in over 15 minutes, there's probably an issue.

The number of messages in the queue also can be a good indicator for the queue backlog. This metric can tell DevOps teams if they need to make scaling actions, but it can also be configured to automatically handle scaling operations. For example, if a queue has more than 1,000 messages that need to be processed, teams can automatically add more processing servers to handle the load. When the queue drops to normal levels for an hour, the team can terminate any extra instances that were started -- keeping a minimum number of instances running to maintain smooth operations.

Monitor Kinesis

Amazon Kinesis enables developers to build applications that are decoupled between services while maintaining quick throughput. But Kinesis doesn't have queues like SQS, so it's not as easy to monitor if there is a backlog or delay somewhere along the pipeline. The service does offer a small set of metrics that help IT teams identify which types of delays they could encounter. There are two key metrics important for real-time application monitoring with any Kinesis stream: GetRecords.IteratorAgeMilliseconds and GetRecords.Latency.

The IteratorAgeMilliseconds metric informs DevOps teams the age of the oldest record is in the stream that hasn't been processed. For example, DevOps teams may want to instruct CloudWatch to send out an alert if it's taking more than five seconds. This delay is how long from the time something is put into the stream until it is processed. A high IteratorAge could indicate that IT teams need to allow more processing functions to happen simultaneously, indicating the need to scale out more processing instances. Or, if the application is running in Lambda, a delay could indicate that the Lambda function is being throttled.

Latency tells DevOps teams how long a single record takes to be processed, which excludes any backlog. This time it just monitors how long from the Get until the message is removed from the stream or processed. If this is a very high number, it may indicate an issue with the processing function that is handling stream events, or that the function processing events simply needs to be reworked to process faster.

Monitor Lambda functions

In the real-time application monitoring world, DevOps teams also need to examine a few key metrics for Lambda functions. Even though Lambda handles automatic scaling, there are safeguards in place to prevent functions from running out of control.

There's a build-in limit to how many functions can run concurrently per region. IT teams can increase these limits by submitting support tickets through the AWS console; however, developers need to know if they're hitting the limit before requesting an increase. Adding alerts to the Throttled Invocations metric in CloudWatch will identify these issues.

Additionally, CloudWatch enables DevOps teams to monitor Invocation Count, Duration and Errors. Alerts should be placed on the Invocation Errors metric to notify developers when a function has an unusually high amount of failures. Logs should be added to notify teams of any standard error; but logs aren't always enough -- unexpected events can occur. It's also important to add alerts to the Invocation Duration metric. If the average invocation duration exceeds four minutes, for example, consider using a service other than Lambda, which has a maximum of five-minute execution time.

Analyze metrics and detect trends

Real-time application monitoring is useful, but it's also important to detect trends over longer periods of time. For example, a media company may notice that it has a high volume of stories to process during the hours of 5 a.m. and 11 a.m. The IT team can set up Auto Scaling rules to automatically spin up extra servers to process events at 4:30 a.m. and then return to normal at 11:30 a.m. This ensures there's enough capacity and the company doesn't waste time waiting for a backlog. The team could also put rules in place to automatically handle random bursts of traffic, which helps for random events.

Real-time application monitoring in AWS enables IT teams to detect patterns and proactively take action. It's not necessarily an issue if an event is delayed a minute or two, but it is a problem if all events are delayed by two minutes each morning.

Next Steps

Chapter excerpt: Your guide to Amazon CloudWatch metrics

Get started with CloudWatch Logs

Microservices monitoring how-to

Dig Deeper on AWS CloudWatch and application performance monitoring