Tuesday Take-Away: The True Role of the SLA

As you look towards cloud solutions for more cost effective applications, infrastructure, or services, you are going to hear (and learn) a lot about Service Level Agreements, or SLAs.  Much of what you will hear is a big debate about the value of SLAs and what SLAs offer you, the customer.

Unfortunately, the some vendors are framing the value of their SLAs based on the compensation customers receive when the vendor fails to meet their service level commitments.  The best example of this attitude is Microsoft’s comparison of its cash payouts to Google’s SLA that provides free days of service.  Microsoft touts its cash refunds as a better response to failure.  Why any company would send out a marketing message that begins with “When we fail …” is beyond me.  But, that is a subject for another post someday.

That said, Microsoft and its customers that are comforted by the compensation, are totally missing the point of the SLA in the first place.  Any compensation for excessive downtime is irrelevant with respect to the actual cost and impact on your business.  And unless a vendor is failing miserably and often, the compensation itself is not going to change the vendor’s track record.

The true rule of the SLA is to communicate the vendor’s commitment to providing you with service that meets defined expectations for Performance, Availability, and Reliability (PAR).  The SLA should also communicate how the vendor defines and sets priorities for problems and how they will respond based on those priorities.  A good SLA will set expectations and define the method of measuring if those expectations are met.

Continuing with the Microsoft and Google example.  Microsoft sets an expectation that you will have downtime.  While the downtime is normally scheduled in advance, it may not be.  Google, in contrast, sets an expectation that you should have no downtime, ever.   The details follow.

Microsoft’s SLA is typical in that it excludes maintenance windows, periods of time the system will be unavailable for scheduled or emergency maintenance.  While Microsoft does not schedule these windows at a regular weekly or monthly time frame, they do promise to give you reasonable notice for maintenance windows.  The SLA, however, allows Microsoft to declare emergency maintenance windows with little or no maintenance.

In August 2010, Microsoft’s BPOS service had 6 emergency maintenance windows, totaling more than 10 hours, in response to customers losing connectivity to the service, along with 30 hours of scheduled maintenance windows.  In line with Microsoft’s SLA, customers experienced more than 40 hours of downtime that month, which is within the boundaries of the SLA and its expectations.  On August 17, 2011, Microsoft experienced a data center failure that resulted in loss of Exchange access for its Office365 customers in North America for as long a five hours.  The system was down for 90 minutes before Microsoft acknowledged this as an outage.

Google’s SLA sets and expectation for system availability 24x7x365, with no scheduled downtime for maintenance and no emergency maintenance windows.

The difference in SLAs sets a very different expectation and makes a statement about how each vendor builds, manages, and provides the services you pay for.

When comparing SLAs, understand the role of maintenance windows and other “exceptions” that give the vendor an out.  Also, look at the following.

  • Definitions for critical, important, normal, and low priority issues
  • Initial response times for issues based on priority level
  • Target time to repair for issues based on priority level
  • Methods of communicating system status and health
  • Methods of informing customers of issues and actions/results

Remember, if you need to use the compensation clause, your vendor has already failed.