Home > Research > AWS Improves Container Monitoring, Part 3: Anomaly Detection

AWS Improves Container Monitoring, Part 3: Anomaly Detection

AWS is previewing Anomaly Detection, a CloudWatch component that allows for greater visibility into containerized applications using microservice architectures.

Anomaly Detection is aimed at improving monitoring by making it easier to set up alarms.

Traditional alarms in AWS are based on static, set thresholds. For example, you might configure an instance to set off an alarm once CPU usage reaches >90% utilization. The user has to do the work to make sure that the threshold is reasonable, and the threshold is fixed until the alarm is changed or modified.

The problem with set thresholds is that they do not account for seasonality or other variance. Setting the threshold too low will result in a number of false, inactionable alerts triggered by normal workload behavior. False alerts waste time and desensitize administrators to the truly important alerts. If you set the threshold too high, performance might suffer before you ever receive an alert, or you might receive an alert once the situation has already deteriorated.

Anomaly Detection solves this problem by allowing AWS users to have more dynamic thresholds on their alarms. It creates a band around performance metrics (e.g. CPU or memory utilization) based on the range within which those metrics would be expected to fall.

The service uses machine learning to determine what an appropriate default band size would be and also allows the user to edit the size of the band. Furthermore, the service also uses machine learning to adapt to the usage patterns of the workload in your environment.

For example, it will learn when normal spikes in usage occur and increase the threshold to avoid setting off false alarms for these normal spikes (e.g. peak hours for visits to your website). Currently, it uses the past two weeks of usage data.

Anomaly Detection also allows users to exclude certain time periods from the machine learning training. This allows you to avoid training it on outlier data that might skew the model.

Once the band is established, users can decide whether to set the alarm for usage patterns that are outside the band, below the band, or above the band.

Users can receive alerts from the alarm via email or SMS and can even trigger a Lambda function from the alert.

Our Take

False positives are a large problem for operations teams. Relying on static thresholds for alarms does not reflect reality in all scenarios.

Out-of-the-box thresholds that are dynamic and based on machine learning will be a welcome solution to cloud operations teams.

Anomaly Detection should make monitoring easier and give AWS users a head start, but AWS users should avoid treating it as a set-it-and-forget-it feature and should not expect it to do all the heavy lifting for them.

IT professionals will have to experiment with Anomaly Detection to ensure that it suits their service’s performance-monitoring needs. At the end of the day, they may still need to supplement it with their own or third-party solution to provide adequate monitoring for certain workloads.


Want to Know More?

AWS Improves Container Monitoring, Part 1: Observability

AWS Improves Container Monitoring, Part 2: Container Insights