CallRail customers recently experienced a significant service issue due to a configuration error in our cloud storage systems. On June 28, 2019, we inadvertently deleted a portion of greeting messages used in customers’ call flows, as well as a portion of call recordings older than 30 days. We are truly sorry for the disruption we caused you, your business, and your clients.
Over the past two weeks, we’ve been focused on helping customers repair call flows and discussing how we will prevent similar incidents in the future. In the interest of transparency, I’d like to share what happened, what we’ve learned, and how we plan to move forward from here.
Note: The next section gets a little technical. If you just want to know what we plan to do about it, skip ahead.
CallRail stores customer data in two primary places: a Postgresql database, and in Amazon’s S3 storage system. The first is used for things like accounts, users, and tracking number configurations, and was not affected by this incident. The second is for customer files: call recordings, MP3s used as greetings for incoming calls, logo images, and historical copies of invoices.
If you aren’t familiar with S3, you can think of it as a cloud storage system for web applications. It’s the utility-scale version of Dropbox or iCloud Drive. (Dropbox was actually built on S3 for its first 10 years.)
Inside of S3, files are stored with multiple copies on many disks spread across several data centers, so the chances of losing any one file is 1 in 100 billion. This level of physical redundancy is far more reliable than most backup strategies. So how does one lose some portion of customer data through system with that level of reliability? The short answer is: human error.
In this case, we had written a lifecycle policy intended to delete certain types of temporary files after 30 days. That policy had a filter limiting it to those temporary files, but that filter was inadvertently removed. This happened despite following our standard code review process. As a result, all files older than 30 days were scheduled for deletion, including the affected customer data and other backups of that data.
In analyzing this event after the fact, we realized that we hadn’t fully accounted for the possibility of human error when designing this system. We take a great number of steps to ensure reliability and durability for our Postgresql database: multiple streaming replicas (some on time delay), snapshots, and point-in-time recovery logs are all part of our strategy. That covers a number of concerns, from hardware failure to engineering mistakes to some types of malicious intent. But our strategy for customer file storage was less rigorous — it covered hardware failure and some types of engineering mistakes, but not everything.
The change was made on the afternoon of Thursday, June 27, 2019. Files started being deleted at midnight, and we realized the mistake on the morning of Friday, June 28. Our engineering team immediately set out to stop the deletion and determine the scope of impact. By afternoon we had identified and notified customers affected by the loss of greeting messages, and began helping them restore their call flows. We spent most of the weekend focused on this effort.
By Monday, July 1, that urgent concern was mostly resolved, and we began investigating the scope of loss for customer call recordings. We spent the next two days recovering as much as possible, and on Wednesday, July 3, we were able to notify customers affected by the second issue.
We’re committed to rebuilding your trust and ensuring this event and anything like it are never repeated. The solutions we’re presenting here are the result of numerous internal discussions, as well as feedback from customers like yourself.
The following projects are underway to prevent issues like this in the future:
Segregation of data types in our cloud storage system. Mixing temporary files with more permanent ones opened the possibility for a configuration error to inadvertently apply to the wrong type of data. We’ll segregate these classes of files into unique storage buckets so that the rules and data policies are clear, future changes are isolated to one type of data, and the chance of mistakes is reduced.
Cold storage backups. We had other backups of these call recordings, but they were stored in the same place and therefore got caught in the same configuration error. To eliminate this possibility in the future, we’ll implement a cold storage system that stores archival copies of customer data in a separate place safe from changes to our production environment.
Grow our Site Reliability Engineering practice. Our Infrastructure and Development Operations teams did not have enough time to focus on reliability engineering. We’re planning to grow that team by 50 percent in the coming months so we can dedicate more effort to anticipating and preventing incidents like this one.
Reexamine our Incident Response plan. While we were generally able to detect, investigate, and communicate about this problem quickly, this was the largest incident we’ve experienced and there are a number of things we could have done better. We would have preferred to identify and notify the affected customers sooner, and our communication was not as clear as it could have been in all cases. We’ll be working across the company to find ways we can improve our communication plans.
Every day, over 100,000 businesses trust CallRail to measure their marketing efforts and connect them with their customers. It weighs on me heavily that we’ve given you reason to question that confidence. We’re reflecting and learning from this mistake, and our team and our platform will be stronger than ever in the future.
Thank you for your trust and for the opportunity to serve you.
Chief Technical Officer, CallRail