Upstream Network Impairment
Incident Report for CallRail
Postmortem

Postmortem

for call connectivity downtime, May 14, 2018

On May 14, 2018, CallRail customers experienced a significant service interruption. This prevented calls from being delivered and impacted the availability of the application dashboard. We know any failure of call routing is a major disruption to our customers' businesses, and we take that responsibility extremely seriously. We have spent the last week discussing this incident internally and with our partners, and want to share a bit about what happened and the steps we are taking to prevent a reoccurrence.

Impact Timeline

Period Impact
1:30pm to 2:39pm EDT Calls were tracked, but had elevated failure rates. Calls that succeeded during this timeframe followed normal Call Flow routing patterns and were recorded if configured. The application dashboard was available as expected.
2:39pm to 3:22pm EDT Calls were delivered directly to destination numbers via our backup routing system. During this time calls were not tracked, and would not have been recorded or followed Call Flows. The application dashboard was also unavailable. Analytics data for these calls has since been recovered.
3:22pm to 4:09pm EDT Calls were tracked, but had elevated failure rates. Calls that succeeded during this timeframe followed normal Call Flow routing patterns and were recorded if configured. The application dashboard was available as expected.
4:09pm to 4:40pm EDT Calls were delivered directly to destination numbers via our backup routing system. During this time calls were not tracked, and would not have been recorded or followed Call Flows. The application dashboard was also unavailable. Analytics data for these calls has since been recovered.
4:40pm EDT Normal call behavior resumed, and the application dashboard became available again.

Incident History

At 1:45pm EDT, the CallRail team began to notice elevated call failure rates with one of our major carrier partners. Our engineering team immediately started an investigation, but was unable to find a root cause in a reasonable timeframe. By 2:00pm EDT support tickets began escalating, and we notified our carrier partner to engage their help. We modified call routing plans in an effort to isolate the problem, and we rolled back all application changes to a known good state from the prior week. However, the situation only seemed to worsen and neither CallRail nor our carrier partner could identify the source of the problem.

By 2:30pm EDT it was apparent that a majority of calls were now failing and our efforts were not resolving the issue. At 2:39pm EDT we made the decision to take the application offline in order to force all calls to backup routing.

Backup routing is a last-ditch effort to deliver calls to their destination. It operates at our carrier, without any interaction from our systems. When backup routing takes over calls are delivered to the primary destination number, but recording, tracking and analytics, and advanced Call Flow control is unavailable. Once backup routing was forced, calls began completing to their destinations again.

Our team continued to search for a root cause with our carrier partner. No other customers of our carrier partner were reporting issues, and traffic was flowing smoothly through backup routing. With no paths to follow, we decided to bring the application back online at 3:22pm EDT. Calls completed as expected, so we decided to monitor closely.

By 3:39pm EDT our monitoring tools were hinting at a problem again, but evidence was again inconclusive. We continued searching for a root cause, and by 3:47pm EDT CallRail identified a configuration at our carrier partner that limited the number of calls we could forward in a given period of time. We contacted our carrier partner asking to raise this limit at 3:55pm EDT.

As we exceed this limit, calls are queued and delivered when resources are available. Initially this caused a delay of one, two, or five seconds. Of course any delay is undesirable, but most calls were completing as expected with potentially an extra ring on the caller’s end. However, our monitoring tools were indicating a growing problem.

Changing this limit proved not to be straightforward. By 4:09pm EDT the problem had grown significantly, and calls were queueing for 20 or 30 seconds, which is long enough that the caller would hang up. We decided to take the app offline again in order to force calls to backup routing. By nature of its simplified routing plan, backup routing eliminated simulcalls and therefore reduced the number of outbound calls our platform would need to place, which would keep us safely under this configuration limit until it could be raised.

At 4:40pm EDT we determined that the normal tapering of calls at the end of the east-coast business day had decreased call traffic enough that our prior limit could handle the volume. We brought the application back online at this time and normal call routing resumed.

At 5:30pm EDT we received word that our carrier partner had raised this limit and we declared the incident resolved.

Data Recovery

Unfortunately, due to the nature of how our backup routing mechanism works, it will not be possible to retrieve recordings of calls that occurred when backup routing was enabled. Even if this were possible, our backup routing system cannot play greeting messages and so it would be inappropriate to recover recordings due to notification policy and compliance concerns.

A preliminary version of this postmortem suggested we were attempting to recover analytics and call log data for calls that occurred during backup routing. Our team was able to recover this data, and all calls routing during the periods identified as backup routing above are now visible in your dashboard. Analytics and visitor session data were properly tracked during this downtime period.

Remediation

The immediate remediation for this failure is to raise the dial limit with our carrier partner. This has been raised by 50% in the short-term, and we are currently working with them to determine an appropriate level for longer-term stability.

CallRail and our partner have identified several other remediations to be undertaken in the coming weeks to prevent events like this in the future.

  1. With better understanding of this limit, CallRail will monitor our platform for performance against this limit.
  2. Our carrier partner will provide tools to monitor this limit as well, and will monitor our performance against this and other account-based limits.
  3. CallRail will implement additional call monitoring to provide earlier notification of slow-growing problems such as this. In retrospective our existing monitoring focused on large sudden shifts, which hindered our detection of this slower-growing problem.
  4. Our carrier partner is working to enable faster modification of this and other configuration settings in emergency situations. They have also provided a better communication channel between our teams for situations such as this.
  5. CallRail is building finer-grained control over which calls go to backup routing. This will be usable without requiring the entire application to be taken down, and in emergency situations will allow us to limit the impact to a smaller set of customers.

Conclusion

CallRail’s performance and uptime is a source of pride for our entire team, and is one main reasons our customers choose us. We failed to deliver on that promise last week. Ultimately, this incident was caused by growth – the CallRail platform handles tens of thousands of phone calls every hour, and this incident highlighted a key choke point we had not yet anticipated. We have learned from this oversight, and are taking steps to resolve those concerns. We hope this explanation has clarified the cause of this event, and provided confidence in our stability as we continue to grow.

Posted 9 months ago. May 15, 2018 - 09:42 EDT

Resolved
We identified a resource limit with one of our upstream telephony providers which was causing calls to fail. We've received confirmation from our provider that the limit has been increased and all functionality has now been restored.

We are working with our provider to determine appropriate next steps to ensure that this issue does not occur again.
Posted 9 months ago. May 14, 2018 - 17:47 EDT
Update
We are continuing to monitor for any further issues.
Posted 9 months ago. May 14, 2018 - 17:05 EDT
Update
We have identified the issue and are working with our upstream telephony provider to resolve.
Posted 9 months ago. May 14, 2018 - 16:43 EDT
Update
We have moved the CallRail application back into emergency maintenance mode to force all calls to backup routing. We are continuing to work with our provider.
Posted 9 months ago. May 14, 2018 - 16:12 EDT
Update
We are continuing to monitor for any further issues.
Posted 9 months ago. May 14, 2018 - 15:29 EDT
Monitoring
The application has been restored and call connection rates are improving. We are continuing to monitor call outcomes and are working with our providers to determine the root cause.
Posted 9 months ago. May 14, 2018 - 15:28 EDT
Update
We are continuing to investigate this issue.
Posted 9 months ago. May 14, 2018 - 14:49 EDT
Update
We are still investigating an issue routing phone calls at our upstream telephony provider. For the time being we have put our application into emergency maintenance mode to force all calls to backup routing. We are continuing to work with our provider to restore connectivity.
Posted 9 months ago. May 14, 2018 - 14:43 EDT
Investigating
Beginning at 1:30 est we have identified a decrease in calls complete in our upstream network providers. We are working with our upstream providers to restore connectivity for affected customers.
Posted 9 months ago. May 14, 2018 - 14:08 EDT
This incident affected: Application Dashboard, Call Routing, Backup Call Routing, DNI and Keyword Tracking, and Call Recording.