A faulty load balancer policy resulted in approximately 70 minutes of degraded availability for the Robin API between 07:30 - 08:40 UTC on Monday, April 11. During this time web, mobile, and room display clients were unable to consistently connect to the service.
The issue was identified at 07:30 UTC (03:30 EDT) after we observed a sudden spike in service latency that made the API unreachable by clients. Full API service was restored by 08:40 UTC after we overrode a faulty autoscaling policy which prevented new server instances from being launched.
Our API's load balancer has an automated health check which monitors server instances that need replacement or additional help for increased server loads. An issue with this health check caused all API instances to simultaneously report unhealthy. This caused our auto-scaling policy to terminate all of the marked instances and attempt to replace them with new, healthy boxes.
When the autoscaling policy attempted to replace the API instances, it was unable to because the process was completely suspended. This led to a situation where the API knew it was not healthy, but lacked the ability to self-correct.
The autoscaling policy was updated to use a more thorough application health check, which will prevent the false positives in this incident.