WebCOBRA.com Outages
How to Handle the InevitableWho Does What, When
The goal of WebCOBRA.com Operations is to provide a stable platform 24 hours a day, 7 days a week. Over the last two years (Feb 2005 - Feb 2007), our planning and execution allows us to state that we had 0 minutes of unscheduled downtime and only 85.1 minutes of scheduled downtime. This gives us a Service Level of 99.91%.
But what about that 0.9% -- What would Travis Software Corp. do if we did have an unscheduled outage? If we have an outage, someone in a position to decide if it is in fact, an outage, declares it, and then we shut the system off, and notify our users until we get the outage stabilized.
Defining an Outage
An outage is defined as a service level interruption affecting all users. This could be a networking problem, such as the web site cannot be reached. It could be a significant slowdown where the system feels unresponsive. Or it could be a programmer error, where the application is altering data that it should not to an extent that restoring a backup is necessary.
Who Can Declare an Outage
The following people can declare an outage:
- President of Travis Software Corp.
- Director of Development
- Director of Support
- Team Lead: Network Systems Administrator
There are few occasions when the President, CSA, and Director of Support are out of the office at the same time. If that is the case, the remaining personnel at Travis, such as the Assistant Director should meet with the ranking Network System Administrator and decide on their own if an outage should be declared.
Declaring an Outage
Once the decision to declare an outage has been reached, Travis personnel should be notified. The decision maker should send out an email to "All Travis" that contains the following information:
Subject: WebCOBRA.com Outage Declared at MM/DD/YYYY HH:mm
Text: We have declared an outage for WebCOBRA.com. We will be shutting off access to the Servers from the outside world, and redirecting them to an unexpected outage page. More information will be available soon.
We will be emailing all of our current WebCOBRA.com customers and telling them of the outage. When the outage has been resolved, we will email both you and our WebCOBRA.com customers.
Actions to Take During Outage
Once an outage has been declared, the first step is to secure the system. The second is to notify users. Then work should start on correcting the system.
Turning System Off
To turn off the system, the Network Load Balancer should be set to have All WebCOBRA Nodes turned off. Both the Director of Development and the Team Lead of Network Operations will know how to do this.
Next, the WebCOBRA.com Asynchronous Services need to be turned off at each Application Server. This will stop Letter Production.
In the event that both CSA and the NSA are unavailable, our Hosting Provider will be told them that we need to initiate the Outage Protocol and have them log in to the network devices and turn off our nodes and the WebCOBRA.com Asynchronous Services.
Notifying Users
We use Jangomail to communicate planned outages with our WebCOBRA.com customers, so it makes a natural fit to communicate unplanned as well.
Then, either the CSA or Administration could login to Jangomail and send out an email to our customers that contains the following information:
Subject: WebCOBRA.com Service Interruption at MM/DD/YYYY HH:mm
Text: WebCOBRA.com has been informed of a Service Level Interruption in our Houston West Datacenter. We have locked down the WebCOBRA.com service and will bring the system up as soon as possible.
Once we have verification that we have solved the issue and the system is fully operational, we will email you again letting you know you can log back into the system.
Post Outage
Once the Outage has been solved, and the system is back online, we need to notify the users that they can log back in. We'll send another email using Jangomail to the same list of recipients in the WebCOBRA.com Current Customers group, telling them:
Subject: WebCOBRA.com Service Restored at MM/DD/YYYY HH:mm
Text: The Service Level Interruption experienced at our Houston West Datacenter has been resolved. You can log back into the system at https://app.webcobra.com/exec
We are currently working on a system that will prevent this kind of outage from occurring in the future. Thank you for your understanding. In the most extreme case, this email should detail if we needed to restore their database from the previous night.
Learning from What Happened
After the outage occurs, the Director of Development will be responsible for producing a report within 24 hours detailing the cause of the problem and the fix to the problem.
Then, the next day the Managers of Support, Sales, Executives, and WebCOBRA.com Operations should meet to discuss the outage. This postmortem will dissect how Travis acted in the crisis, and what can be done to improve how we handled it.

