Provided below is a recap of Cerkl's outage from 12/20/2017 - 12/23/2017 and steps we’ve taken to prevent future issues.
SO, WHAT HAPPENED?
12/20, 8:44am EST: We encountered an issue with the primary Cerkl database.
12/20, 8:46am EST: As per protocol, we attempted to do a restore from the most recent full backup which was performed at 12/20 6:45am EST. The restore process failed. We retried an additional 2 times with the same result.
12/20, 10:20am EST: After further research, we found that the error was being produced by a custom Google restoration process. We opened a support case with Google. We have the highest level of support available and expect to receive a response within 1 hour according to our agreement with Google.
12/20, 10:54am EST: Google support replied that they were looking into the error.
Beginning at 10:54am, our case was transferred to 14 different people within Google over the course of the next 55 hours. During that time, we spoke to Google support 19 times in attempts to get them to move more quickly.
At 12/21, 2:14am EST they confirmed what we told them in the original ticket. Their restoration process was failing.
Friday, 12/22, 5:15 PM EST: They were finally able to restore a version of our primary database and provide us with access to the files we need to finish the process.
Saturday, 12/23, 3:35am EST: Our engineering team was able to complete the restore with no data loss. After hours of validation and testing, the site was back on line at 8:05am EST.
WHAT WAS THE ROOT PROBLEM?
Google uses a proprietary restoration process which keeps them from being able to access/see our data for privacy/security. The problem is when something proprietary doesn’t work as expected - issues require Google to fix them. We were handcuffed and at the mercy of Google to fix the issue.
WHAT HAVE WE DONE TO PREVENT THIS IN THE FUTURE?
In addition to the database replication in 2 zones with fail-over, daily full backups, and binary logging for point in time restore that we already had, we’ve also added a nightly database dump that will be saved to an isolated environment. That means that if we ever have another issue with Google restoration, we won’t be held hostage - we’re now able to manage the restoration without them. This new process has already been put in place.
In addition, we’ll be launching a status page which will give you real-time access to the availability of different components of the Cerkl platform, up-time history for each component and incident reports.
Outside of what we can control, we’re going to be working with Google to improve their support processes. Regardless, your relationship is with us and we value that relationship and your success.
Again, thank you for your patience. I can personally assure you that the availability of Cerkl is of the utmost importance.
Posted Dec 20, 2017 - 12:00 EST