Room Display & Dashboard Outages
Incident Report for Robin
Postmortem

Summary

Between the hours of 20:45 UTC (Jan. 5) - 03:30 UTC (Jan. 6) the majority of tablets were experiencing API request errors. The errors were due to an extraneous 301 redirect introduced to the production environment which forced clients contacting the affected server to use an incorrect API server indefinitely, despite the redirect being removed almost immediately.

At 03:15 UTC an update was rolled out to tablet devices, repointing them back to the correct servers. Service was restored to the majority of affected devices by 03:30 UTC.

Investigation

Upon reports of misbehaving tablets, we quickly realized that many clients were receiving erroneous 404 responses. This type of response indicated to us that there may be an issue with these clients making requests to the wrong locations.

We noticed that the amount of requests to our application servers nearly halved. It was soon found that an HTTP to HTTPS permanent redirect was pointing clients to the wrong server cluster. Because these redirects are 301, HTTP clients being utilized on devices will elect to never again use the previous, correct endpoint. Even once the 301 redirect is removed, due to heavy caching these devices continued to use the incorrect location.

Remediation

The faulty redirect had already been removed from the original cluster by the time the issue had become realized, however further steps were needed to mitigate the issues on a large amount of devices that had become affected.

  1. A proxy server was deployed to the cluster that the 301 had redirected the clients to, which would point clients back to a correct URL.

  2. All tablets were remotely updated to alter their API endpoint slightly so that it could not be recognized as the same endpoint that was 301 redirected. This allowed those tablets to immediately come back online.

  3. Evaluation of our deployment process is being conducted to help identify how the 301 redirect was allowed to make it into the application cluster and steps are being taken to adjust our internal processes to protect against a similar issue occurring in the future.

  4. Our clients and servers will be updated with safer caching headers to help mitigate any future accidental 301 redirects.

Posted Jan 06, 2017 - 16:02 EST

Resolved
The issue should now be resolved. Tablets should restore themselves automatically. If a tablet fails to reconnect on its own please contact support@robinpowered.com.

A post mortem will be posted tomorrow (January 6th).
Posted Jan 05, 2017 - 23:52 EST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jan 05, 2017 - 22:09 EST
Update
The issue has been identified as an erroneous 301 redirect that caused clients, such as the tablets, to use wrong endpoints for the API. We're implementing various strategies and updates to mitigate the effects of this and restore proper API access to these clients.
Posted Jan 05, 2017 - 21:02 EST
Identified
The issue has been identified and a fix is being implemented.
Posted Jan 05, 2017 - 19:41 EST
Update
We're continuing to investigate the outage. We've identified that about 60% of tablets may be experience intermittent issues where it might appear that the tablet is offline or unpaired.
Posted Jan 05, 2017 - 18:53 EST
Investigating
We're investing reports of room displays being unable to connect to Robin and sync their events.
Posted Jan 05, 2017 - 17:32 EST