More about yesterday’s events—and beyond
Dear Customers and Partners,
I hope you’ve seen the letter from Phil Koen, Intermedia’s CEO, explaining the cause of yesterday’s network interruption and what we’re planning to do.
As Intermedia’s COO, I want to apologize again for the interruption to your business. And I want to thank you for being an Intermedia customer or partner.
In this note, I will address some of the common questions we’ve been receiving since Phil Koen’s note was published.
1) Doesn’t Intermedia have redundant systems?
Short answer: All of Intermedia’s systems are indeed redundant. The problem was that the routers connecting all our systems each received an invalid update, which prevented them from passing traffic and took our services offline.
Long answer: Let me be more specific about redundancy. Intermedia has invested tens of millions of dollars into our hardware and systems. Every element of our cloud has redundancy. This includes:
- Redundant databases
- Redundant servers
- Redundant inbound and outbound Tier 1 Internet connections via providers that include Sprint, Level 3 and Verizon
- Redundant power with back-up generators
- Redundant backups of all data, refreshed nightly
- Redundant routers in our network core and network edge
- Redundant large, geographically distributed teams of smart, highly experienced system administrators, engineers and support personnel
- Multiple top-tier, physically secure datacenters throughout the US and Europe (10 in total)
Our cloud is architected to meet and exceed industry redundancy standards. In yesterday’s case, redundancy was not the issue. We could have had hundreds of extra redundant routers, but the nature of this error would have corrupted all of them.
I am confident that this issue has been identified and addressed. Nevertheless, our team is working with our networking equipment vendor on an in-depth architectural review. This will help us assure the integrity of our routers—and our other network components—in the future.
2) I know that Intermedia has a 99.999% uptime Service Level Agreement. Are you going to honor it?
Short answer: Of course. And we’re streamlining the process by waiving the 72-hour reporting requirement. We will credit your account directly. You will hear from us shortly with details.
Long answer: Intermedia’s 99.999% Service Level Agreement guarantees downtime of less than 26 seconds per month. In normal months, we see very few claims on our SLA. I’m very proud of our uptime.
Unfortunately, this is no longer a normal month. Yesterday’s event clearly exceeded the limits of our SLA. And we’re fully prepared to honor our commitment for compensation.
Our policies normally require customers making an SLA claim to open a support ticket within 72 hours of an event. In the case of this event, though, we are waiving that requirement. You do not need to contact us to request an SLA credit.
We are identifying the easiest method for issuing SLA credits to all our customers. You will hear more about this soon. In the meantime, just to reiterate, there’s no need to open a support ticket for the purpose of SLA credit.
(However, I do welcome you to contact your account manager if you have questions about the SLA, or contact Support about anything else.)
3) Why did I have so much trouble contacting Support yesterday?
Short answer: Our phones and email went down, too.
Long answer: Intermedia isn’t just the provider of our Office in the Cloud—we’re also a customer. Our phones, email, chat, content management and call center tools all come from the same cloud as the services we provide to you. This router issue disrupted our services as well.
I continue to have full trust in our network and in the people who maintain it. Therefore, we will continue to rely on our own services. To mitigate risk, we will take a number of steps to isolate our critical communications tools from other parts of our network.
In addition, we will use alternate communications channels more regularly—not just in the case of service events, but to make sure we’re more available to you on an everyday basis.
More about this in question #4.
4) How are you going to improve your communications?
Short answer: In addition to isolating our communications tools within our network, we’re going to integrate more channels into our communications processes.
Long answer: Our customer communications processes and systems have historically been built around three touch-points: phone support, email support, and our customer notification tool (which utilizes SMS, HostPilot, PBX outage messages, and alternate email addresses).
Due to this network routing issue, we were unable to communicate with many of you via our phones and email for part of the day. And not all of you received messages via our customer notification tool.
I know how frustrating this was. It was just as frustrating for me to watch our phones sitting silently, knowing you were trying to reach us.
Many of you were updated via our website, blog and Facebook page. Many others reached out to us via Twitter. At one point, we were receiving 40 messages a minute. It took us some time to cope with this flow, but we eventually were able to respond individually to as many questions as we could. We were slower to leverage these alternative communications platforms than I would have liked; but once we did, it was clear that customers appreciated the fact that they could get regular updates and communicate with us.
With this experience in mind, our operations team is improving its use of Twitter, Facebook, LinkedIn, Spiceworks and other platforms in our communications processes.
More than that, we will leverage our website and our blog much more actively to communicate with you. This will go beyond service events to be more active in broadcast news about upgrades, service enhancements, bug fixes and much more.
5. What happens next?
I want to reiterate what Phil said in his letter: all our systems are running normally. This event is behind us. Our priority now is to complete our RFO (Reason for Outage) report. We’re working with our network equipment vendor to do so. We will share it with you when it’s complete.
As I’ve outlined above, there are specific steps that we’re already taking. The RFO will certainly give us more ideas. We are also listening very closely to customers and partners. We plan to invest in and implement changes that will make us better.
In the meantime, I thank you for your continued trust in Intermedia. If you have any concerns or questions, don’t hesitate to contact us at 800.379.7729 or firstname.lastname@example.org. Also, please continue visiting our blog for further updates.
Chief Operating Officer, Intermedia