Between the hours of 19:00 UTC - 20:00 UTC our API experienced high latency for response times which had affected all clients on the platform. The latency was caused by an inefficient query in our calendar syncing operations which caused our database to exhaust its resources.
At 19:05 UTC a rollback was issued for the previous deployment to quickly mitigate the issue. The database did not fully recover until 20:00 UTC when some of the remaining problem queries were flushed out.
By 19:05 UTC our engineering team was notified through platform monitoring of latency issues that were affecting all customers. The problem query was quickly found after the rollback and discovered running in the database at 19:15 UTC.
The API issued a deployment rollback to initially revert the inefficient query. Calendar syncing was temporarily halted at 19:15 UTC to take load off of the primary database. Lingering problem queries were identified and killed off in production to free up the database's resources by 19:45 UTC. Syncing was reenabled at 20:00 UTC and completely caught back up at 21:30 UTC.
In order to prevent similar issues in the future, we are introducing tools and testing to catch inefficient queries before they can be deployed to production environments. Additionally we are making configuration changes to the database and platform that will drastically decrease recovery time in the future.