Once again, a flagship Microsoft cloud service blows through the Service Level Agreement like a blizzard through the Midwest. Th February 22nd outage, impacting all Azure users worldwide, lasted more than 12 hours.
The culprit: Microsoft failed to purchase and replace expiring SSL certificates. In other words, Microsoft neglected to renew one of the most basic components that secure the Azure service.
As noted on RedmondMag.com
“Furious customers wanted to know how something as simple as renewing a SSL cert could fall through the cracks. Even worse, how could that become a single point of failure capable of bringing down the entire service throughout the world?”
Once again, an operational error puts thousands of customers in the dark. And this outage is one in a string of major service outages, including:
- An unacknowledged 7.5 hour outage for North American Outlook.com users on February 25th
- A pair of Office 365 outages lasting more than 9 hours in November.
Microsoft described the issue as “A breakdown in our procedures”. If not for the disruption and financial impact for thousands for companies, this statement might be considered almost comical. Ironically, a different certificate error was behind a major Azure outage in February 2012.
To put this in perspective, how would you respond if your internal IT department had Microsoft’s track record of catastrophic failure?
It is difficult to trust that Microsoft has the operational maturity and rigor to design and manage multi-tenant, hosted services. The Azure outage, and others like it, demonstrate immaturity, negligence, or incompetence. Do the reasons matter given the frequency and impact? With certificate outages on two subsequent annual renewal terms, it is hard to believe that Microsoft is learning from its mistakes.