On May 14, 2018, CallRail customers experienced a significant service interruption. This prevented calls from being delivered and impacted the availability of the application dashboard. We know any failure of call routing is a major disruption to our customers' businesses, and we take that responsibility extremely seriously. We have spent the last week discussing this incident internally and with our partners, and want to share a bit about what happened and the steps we are taking to prevent a reoccurrence.
|1:30pm to 2:39pm EDT||Calls were tracked, but had elevated failure rates. Calls that succeeded during this timeframe followed normal Call Flow routing patterns and were recorded if configured. The application dashboard was available as expected.|
|2:39pm to 3:22pm EDT||Calls were delivered directly to destination numbers via our backup routing system. During this time calls were not tracked, and would not have been recorded or followed Call Flows. The application dashboard was also unavailable. Analytics data for these calls has since been recovered.|
|3:22pm to 4:09pm EDT||Calls were tracked, but had elevated failure rates. Calls that succeeded during this timeframe followed normal Call Flow routing patterns and were recorded if configured. The application dashboard was available as expected.|
|4:09pm to 4:40pm EDT||Calls were delivered directly to destination numbers via our backup routing system. During this time calls were not tracked, and would not have been recorded or followed Call Flows. The application dashboard was also unavailable. Analytics data for these calls has since been recovered.|
|4:40pm EDT||Normal call behavior resumed, and the application dashboard became available again.|
At 1:45pm EDT, the CallRail team began to notice elevated call failure rates with one of our major carrier partners. Our engineering team immediately started an investigation, but was unable to find a root cause in a reasonable timeframe. By 2:00pm EDT support tickets began escalating, and we notified our carrier partner to engage their help. We modified call routing plans in an effort to isolate the problem, and we rolled back all application changes to a known good state from the prior week. However, the situation only seemed to worsen and neither CallRail nor our carrier partner could identify the source of the problem.
By 2:30pm EDT it was apparent that a majority of calls were now failing and our efforts were not resolving the issue. At 2:39pm EDT we made the decision to take the application offline in order to force all calls to backup routing.
Backup routing is a last-ditch effort to deliver calls to their destination. It operates at our carrier, without any interaction from our systems. When backup routing takes over calls are delivered to the primary destination number, but recording, tracking and analytics, and advanced Call Flow control is unavailable. Once backup routing was forced, calls began completing to their destinations again.
Our team continued to search for a root cause with our carrier partner. No other customers of our carrier partner were reporting issues, and traffic was flowing smoothly through backup routing. With no paths to follow, we decided to bring the application back online at 3:22pm EDT. Calls completed as expected, so we decided to monitor closely.
By 3:39pm EDT our monitoring tools were hinting at a problem again, but evidence was again inconclusive. We continued searching for a root cause, and by 3:47pm EDT CallRail identified a configuration at our carrier partner that limited the number of calls we could forward in a given period of time. We contacted our carrier partner asking to raise this limit at 3:55pm EDT.
As we exceed this limit, calls are queued and delivered when resources are available. Initially this caused a delay of one, two, or five seconds. Of course any delay is undesirable, but most calls were completing as expected with potentially an extra ring on the caller’s end. However, our monitoring tools were indicating a growing problem.
Changing this limit proved not to be straightforward. By 4:09pm EDT the problem had grown significantly, and calls were queueing for 20 or 30 seconds, which is long enough that the caller would hang up. We decided to take the app offline again in order to force calls to backup routing. By nature of its simplified routing plan, backup routing eliminated simulcalls and therefore reduced the number of outbound calls our platform would need to place, which would keep us safely under this configuration limit until it could be raised.
At 4:40pm EDT we determined that the normal tapering of calls at the end of the east-coast business day had decreased call traffic enough that our prior limit could handle the volume. We brought the application back online at this time and normal call routing resumed.
At 5:30pm EDT we received word that our carrier partner had raised this limit and we declared the incident resolved.
Unfortunately, due to the nature of how our backup routing mechanism works, it will not be possible to retrieve recordings of calls that occurred when backup routing was enabled. Even if this were possible, our backup routing system cannot play greeting messages and so it would be inappropriate to recover recordings due to notification policy and compliance concerns.
A preliminary version of this postmortem suggested we were attempting to recover analytics and call log data for calls that occurred during backup routing. Our team was able to recover this data, and all calls routing during the periods identified as backup routing above are now visible in your dashboard. Analytics and visitor session data were properly tracked during this downtime period.
The immediate remediation for this failure is to raise the dial limit with our carrier partner. This has been raised by 50% in the short-term, and we are currently working with them to determine an appropriate level for longer-term stability.
CallRail and our partner have identified several other remediations to be undertaken in the coming weeks to prevent events like this in the future.
CallRail’s performance and uptime is a source of pride for our entire team, and is one main reasons our customers choose us. We failed to deliver on that promise last week. Ultimately, this incident was caused by growth – the CallRail platform handles tens of thousands of phone calls every hour, and this incident highlighted a key choke point we had not yet anticipated. We have learned from this oversight, and are taking steps to resolve those concerns. We hope this explanation has clarified the cause of this event, and provided confidence in our stability as we continue to grow.