API Outage

Incident Report for Robin

Postmortem

API Outage (2016-04-11 UTC)

Summary

A faulty load balancer policy resulted in approximately 70 minutes of degraded availability for the Robin API between 07:30 - 08:40 UTC on Monday, April 11. During this time web, mobile, and room display clients were unable to consistently connect to the service.

Timeline

The issue was identified at 07:30 UTC (03:30 EDT) after we observed a sudden spike in service latency that made the API unreachable by clients. Full API service was restored by 08:40 UTC after we overrode a faulty autoscaling policy which prevented new server instances from being launched.

Contributing Factors

Our API's load balancer has an automated health check which monitors server instances that need replacement or additional help for increased server loads. An issue with this health check caused all API instances to simultaneously report unhealthy. This caused our auto-scaling policy to terminate all of the marked instances and attempt to replace them with new, healthy boxes.

When the autoscaling policy attempted to replace the API instances, it was unable to because the process was completely suspended. This led to a situation where the API knew it was not healthy, but lacked the ability to self-correct.

Remediation

The autoscaling policy was updated to use a more thorough application health check, which will prevent the false positives in this incident.

Posted Apr 11, 2016 - 13:46 EDT

Resolved

We had a brief outage this morning between the hours of 07:30 - 08:40 UTC where dashboard and mobile clients were unable to connect to our API. The issue has since been resolved.

Posted Apr 11, 2016 - 12:17 EDT