CallRail suffered two significant service interruptions on December 12 and 13, 2016. We know that tens of thousands of businesses rely on CallRail to connect with their customers, and that any downtime is a significant disruption. We take that responsibility seriously, and so we’d like to share what happened and how we plan to prevent similar problems in the future.
These two incidents occurred on consecutive days, but had completely independent causes. The timing was purely an unfortunate coincidence, so I’ll address the two incidents separately.
At 1:53p EST we noticed an increase in request queueing times at our front-end load balancers. Our lead devops engineer immediately began investigating the issue and determined that average database transaction time had nearly tripled. Within a few minutes he had brought in several of our lead engineers and myself to determine the source of this delay.
There were no obvious changes to our traffic patterns, but the primary database was under abnormally high CPU load. We began looking for code changes that might have introduced a slow database query. No detrimental changes were immediately obvious, but a feature change had been deployed earlier in the day that introduced a new query. Although the application had been stable since this deployment, we speculated that this new query had altered the makeup of data cached in RAM at the database. With frequently-accessed data removed from RAM, other transactions could slow down over time. At 2:12p we made the decision to roll back the change and see if database performance improved. No immediate effect was apparent, but we continued monitoring under the assumption that data evicted from the cache would take a while to repopulate.
With performance still not improved by 2:30p, we decided to turn off session polling. This feature helps optimize keyword pool performance for large pools with high visitor turnover. It’s not required for keyword tracking, but helps reduce misattributions. However, it accounts for about 25% of our write throughput on the database. Disabling this feature immediately dropped our request queueing time in half, but did not impact CPU usage at the database and therefore did not bring application performance to acceptable levels.
Our engineering team continued to look for slow queries in the application. Eventually one of our monitoring tools helped us identify a frequent query against our users table that was consuming an inordinate amount of CPU time. This particular query was a complex join central to our user permission system, and had been in the application since at least February. Its performance was acceptable at the time, but as data size grew the join became unwieldy and the database was spending an increasing number of CPU cycles to return the result.
With the slow query identified by 3:00p, we began working to optimize it. Since the query was central to our access control, it was difficult to refactor and crucial to thoroughly validate any changes. Different team members worked on three different approaches in parallel. One promising approach failed our test suite, which caused us to realize it would require a larger refactor than we expected. In the end we were able to rearrange the conditions on the join in order to more fully leverage indexes and optimize the query. After thorough verification against our test suite and in a non-production environment, as well as benchmarking against our production database, we deployed this change at 4:28p. Application performance immediately returned to normal levels. After close monitoring for another hour we re-enabled session polling and returned the application to full operation.
In summary, we shipped a change which increased database load in a new way, and while prior performance tests indicated that the queries were fast, we failed to anticipate the cascading effect against the memory cache at scale. That, combined with another query’s degrading performance over time pushed us against resource limits that caused the application to fail.
Application access was intermittent during the period from 1:53p until 4:32p. During this time many requests timed out or were dropped by the load balancers.
Our call routing, reporting dashboard, and session logging all share this primary database, which meant that many calls failed over to our backup routing system. Others experienced long ring times while our application responded with routing instructions.
In the short term, we are immediately scaling up resources on our database cluster. The new instance types will have 50% more RAM, which will improve page caching. We had already scheduled this migration, but have fast-tracked the project and expect it to be complete by December 19.
We already have a project underway to separate session management traffic from our primary database. This will significantly reduce load on the primary database and application servers, which helps prevent overall system load from affecting call routing. We have a project scheduled for 2017 to entirely separate call routing from our primary application infrastructure, which will ensure that calls complete as expected even if other parts of the platform fail.
Finally, we need to be even more diligent in checking our monitoring tools for early signs of performance problems. We’ll be incorporating additional reviews of this data into our development process to help ensure we catch growing problems like this before they snowball into system outages.
At 11:32a EST we noticed our request queuing time was increasing again, and that time spent in database had increased by a factor of 6. Assuming a repeat of the prior day’s issue, we immediately began looking for other slow queries in the application.
Our monitoring tools indicated general database slowness, but no specific query or tables with performance problems. In addition, CPU usage had not increased. By 11:45a we had identified the disk as the performance bottleneck instead. Our throughput against the disk had fallen significantly, particularly with writes, and our average queued I/O operations count had jumped significantly. This appeared to be a different problem, likely at the infrastructure level.
CallRail’s production database is several terabytes large. We use Amazon’s EBS to store this, backed by SSDs with provisioned IOPS. We were getting less than 25% of our provisioned IOPS. At this point we opened a ticket with Amazon to get more details, and began looking for ways to reduce system load.
At 11:53a EST we put the application in maintenance mode, in hopes that allowing the disk queue to catch up would restore performance. We brought the application back at 12:01p EST, but disk performance immediately became a bottleneck again. At 12:07p we put the application back into maintenance mode.
At 12:14p EST we decided to disable session tracking for keyword pools. This disables dynamic number insertion for customer websites using keyword pools. It also drastically reduces the write throughput to our database. We expected that we could continue serving the application and routing calls even with poor disk performance. By 12:17p the application was back and calls were routing through our primary infrastructure again.
We continued to wait for updates from AWS, but responses were inconclusive. It was unclear if this was a general EBS failure across the entire AWS region, or if it was isolated to a small subset (or even just our disk). We considered failing over to the hot standby database we maintain in a separate AWS availability zone, but did not want to take that step without first understanding whether or not this was a region-wide failure. (AWS availability zones are described as independent shared-nothing facilities, but previous large-scale AWS failures have proven the zones to not be completely isolated.)
This decision was clouded by the fact that it takes 3 days to rebuild a new hot standby database, during which time we would be without a backup database. That replica is primarily for disaster recovery purposes. We also take periodic snapshots and archive the database write-ahead transaction log, so the risk for data loss is extremely low even with one database. But if we were to lose the second database the entire platform would be down for 3 days while we rebuilt a new master. That seemed like an unacceptable risk to take without knowing for certain that the replica’s performance would be unaffected by whatever EBS failures we were seeing. In addition, the replica database would not have a properly warmed cache and so application performance would still be sub-par until that had completed. We were particularly sensitive to this issue given the prior day’s experience.
Finally, at 2:30p we began slowly reintroducing session traffic to the application. In the intervening two hours we built and tested a tool allowing us to re-enable session traffic for increasing percentages of our customers. Disk performance held steady as we scaled traffic back in small increments, and by 3:35p we had re-enabled all session tracking.
We are still communicating with AWS to help us understand full technical details of this disk failure. The current conclusion is that the disk suffered a performance degradation due to failure of underlying hardware. The failure healed itself as blocks were re-routed to other systems and full performance was restored within a few hours. This is expected performance and behavior for EBS volumes when components fail.
Call routing was handled by our backup routing system for two periods of 10 minutes each around 12p EST. This system was designed to deliver calls to customer’s primary phone number when our system is down. However, a separate bug has caused this system to not always update the fallback destination when destination numbers are changed. This bug affected less than 3% of numbers, but caused many calls to be delivered to unexpected destinations.
Session tracking was disabled for about 3 hours during this incident. During this time calls to keyword pool numbers continued to route, but DNI was not active. Visitors to customer websites using keyword pools would have seen the main business phone number.
EBS volumes with provisioned IOPS are expected to perform within 90% of provisioned performance at least 99.9% of the time over the course of a year. Unfortunately that allows for up to 9 hours of degraded performance in a year. Given that expectation, we need to engineer around that limitation and better design for failure.
As an immediate step, we are going to begin running three database servers in our production infrastructure. Having a second replica ready to go will make our team more confident in failing over the first time even without complete understanding of the underlying problem.
Our backup routing system has not aged gracefully, and both this event and the prior day’s served to highlight that once again. Our roadmap calls for independent, geographically redundant call routing systems that guarantee full call flow functionality even if the primary application fails. We are accelerating that roadmap.
In the near term, we also recognize that as customer reliance on complex routing features has increased, our previous product decisions related to how we select the backup destination number are increasingly inaccurate. We will be developing tools to let our support team assist customers with overriding our automatic selection of that fallback destination in the coming weeks.
The downtime incidents of December 12 and 13 had isolated causes, but highlighted the interdependence of our systems on our primary application database. We already have plans to reduce this coupling, but those projects are not yet complete. One major component of this plan has already been underway for several months, but orchestrating a migration of this scale takes time and diligence. We expect to have that complete in Q1 of 2017, at which point a significant amount of load will be isolated from our call routing infrastructure. We will also be working to replace our backup routing with a fully-functional redundant system to eliminate call routing problems when other application failures occur.
We understand that customers place great trust in us when they use our platform in their quest to reach new customers. We take pride in building a call tracking platform that provides rock-solid stability and performance. We fell short of that goal this week, but we’re emerging better prepared for the future.
VP of Engineering | CallRail