API latency

Incident Report for Robin

Postmortem

Summary

Between the hours of 19:00 UTC - 20:00 UTC our API experienced high latency for response times which had affected all clients on the platform. The latency was caused by an inefficient query in our calendar syncing operations which caused our database to exhaust its resources.

At 19:05 UTC a rollback was issued for the previous deployment to quickly mitigate the issue. The database did not fully recover until 20:00 UTC when some of the remaining problem queries were flushed out.

Investigation

By 19:05 UTC our engineering team was notified through platform monitoring of latency issues that were affecting all customers. The problem query was quickly found after the rollback and discovered running in the database at 19:15 UTC.

Remediation

The API issued a deployment rollback to initially revert the inefficient query. Calendar syncing was temporarily halted at 19:15 UTC to take load off of the primary database. Lingering problem queries were identified and killed off in production to free up the database's resources by 19:45 UTC. Syncing was reenabled at 20:00 UTC and completely caught back up at 21:30 UTC.

In order to prevent similar issues in the future, we are introducing tools and testing to catch inefficient queries before they can be deployed to production environments. Additionally we are making configuration changes to the database and platform that will drastically decrease recovery time in the future.

Posted Jan 23, 2020 - 14:46 EST

Resolved

This incident has been resolved.

Posted Jan 22, 2020 - 17:27 EST

Monitoring

A fix has been implemented. Calendar Syncing will be delayed as we work through the backlog.

Posted Jan 22, 2020 - 14:59 EST

Update

We are continuing to investigate this issue.

Posted Jan 22, 2020 - 14:30 EST

Investigating

We're investigating elevated latency in our API.

Posted Jan 22, 2020 - 14:16 EST

This incident affected: Platform (API) and Calendar Syncing (Syncing Service).