Posts

Friday Thought: All Outages are Not Equal

Last week Google Docs experienced an outage lasting about 30 minutes.  Almost immediately, the “reconsider the cloud” articles and blogs began to appear.   Articles like this one on Ars Technica, immediately lump the Google Docs outage with other cloud outages, including Amazon’s outage earlier this year and the on-going problems with Microsoft’s BPOS and Office365 services.

And well no outages are good, they are not all the same.  In most cases, the nature of the outages and their impact reflect the nature of the architecture and the service provider.

  • The Google Docs outage was caused by a memory error and was exposed by an update.  Google acknowledged the error and resolved the issue in under 45 minutes.
  • Amazon’s outage was a network failure that took an entire data center off-line.  Customer that signed up for redundancy were not impacted.
  • Microsoft’s flurry of outages, including a 6 hour outage that took Microsoft almost 90 minutes to fully acknowledge, appear to be related to DNS, load, and other operational issues.

Why is it important to understand the cause and nature of the outage?  With this understanding, you can provide rational comparisons between cloud and in-house systems and between vendors.

Every piece of software has bugs and some bugs are more serious than others.  Google’s architecture enables Google to roll forward and roll back changes rapidly across their entire infrastructure.  The fact that a problem was identified and corrected in under an hour is evidence of the effectiveness of their operations and architecture.

To compare Google to in-house systems, Microsoft releases bug fixes and updates monthly which generally require server reboots.  Depending on the size and use of each server (file/print, Exchange, etc), multiple reboots may be necessary and reboots can run well over an hour.  In the last two years, over 50% of all “patch Tuesday” releases have been followed up with updates, emergency patches, or hot-fixes with the recommendation of immediate action.  Fixing a bug in one of Microsoft’s releases can take from hours to days.  Comparatively, under an hour is not so shabby.

When looking across cloud vendors, the nature of the outage is also important.  Amazon customers that chose not to pay extra for redundancy knowingly assumed a small risk that their systems could become unavailable due to a large error or event.  Just like any IT decision, each business must make a cost/benefit analysis.

Customers should understand the level of redundancy provided with their service and the extra costs involved to ensure better availability.

The most troubling of the cloud outages are Microsoft’s.  Why?  Because the causes appear to relate to an inability to manage a high-volume, multi-tenant infrastructure.  Just like you cannot watch TV without electricity, you cannot run online services (or much of anything on a computer) without DNS.  That Microsoft continues to struggle with DNS, routing, and other operational issues leads me to believe that their infrastructure lacks the architecture and operating procedures to prove reliable.

Should cloud outages make us wary? Yes and no.  Yes to the extent that customers should understand what they are buying with a cloud solution — not just features and functions, but ecosystem.  No, to the extent that when put in perspective, cloud solutions are still generally proving more reliable and available than in-house systems.

 

 

Tuesday Take-Away: The True Role of the SLA

As you look towards cloud solutions for more cost effective applications, infrastructure, or services, you are going to hear (and learn) a lot about Service Level Agreements, or SLAs.  Much of what you will hear is a big debate about the value of SLAs and what SLAs offer you, the customer.

Unfortunately, the some vendors are framing the value of their SLAs based on the compensation customers receive when the vendor fails to meet their service level commitments.  The best example of this attitude is Microsoft’s comparison of its cash payouts to Google’s SLA that provides free days of service.  Microsoft touts its cash refunds as a better response to failure.  Why any company would send out a marketing message that begins with “When we fail …” is beyond me.  But, that is a subject for another post someday.

That said, Microsoft and its customers that are comforted by the compensation, are totally missing the point of the SLA in the first place.  Any compensation for excessive downtime is irrelevant with respect to the actual cost and impact on your business.  And unless a vendor is failing miserably and often, the compensation itself is not going to change the vendor’s track record.

The true rule of the SLA is to communicate the vendor’s commitment to providing you with service that meets defined expectations for Performance, Availability, and Reliability (PAR).  The SLA should also communicate how the vendor defines and sets priorities for problems and how they will respond based on those priorities.  A good SLA will set expectations and define the method of measuring if those expectations are met.

Continuing with the Microsoft and Google example.  Microsoft sets an expectation that you will have downtime.  While the downtime is normally scheduled in advance, it may not be.  Google, in contrast, sets an expectation that you should have no downtime, ever.   The details follow.

Microsoft’s SLA is typical in that it excludes maintenance windows, periods of time the system will be unavailable for scheduled or emergency maintenance.  While Microsoft does not schedule these windows at a regular weekly or monthly time frame, they do promise to give you reasonable notice for maintenance windows.  The SLA, however, allows Microsoft to declare emergency maintenance windows with little or no maintenance.

In August 2010, Microsoft’s BPOS service had 6 emergency maintenance windows, totaling more than 10 hours, in response to customers losing connectivity to the service, along with 30 hours of scheduled maintenance windows.  In line with Microsoft’s SLA, customers experienced more than 40 hours of downtime that month, which is within the boundaries of the SLA and its expectations.  On August 17, 2011, Microsoft experienced a data center failure that resulted in loss of Exchange access for its Office365 customers in North America for as long a five hours.  The system was down for 90 minutes before Microsoft acknowledged this as an outage.

Google’s SLA sets and expectation for system availability 24x7x365, with no scheduled downtime for maintenance and no emergency maintenance windows.

The difference in SLAs sets a very different expectation and makes a statement about how each vendor builds, manages, and provides the services you pay for.

When comparing SLAs, understand the role of maintenance windows and other “exceptions” that give the vendor an out.  Also, look at the following.

  • Definitions for critical, important, normal, and low priority issues
  • Initial response times for issues based on priority level
  • Target time to repair for issues based on priority level
  • Methods of communicating system status and health
  • Methods of informing customers of issues and actions/results

Remember, if you need to use the compensation clause, your vendor has already failed.