Microsoft Explains Lengthy Windows Azure Outage -- Visual Studio Magazine

Microsoft Explains Lengthy Windows Azure Outage

The service disruption happened on Dec. 28, 2012 and affected 1.8 percent of Microsoft's total Windows Azure Storage accounts.

By Kurt Mackie
01/18/2013

It was another reminder of the potential pitfalls of cloud computing: late last year, many Microsoft customers in the South experienced a lengthy outage to their Windows Azure Storage service.

It wasn't the first outage experienced by Azure, but it was the longest. That's why Microsoft on Wednesday issued a detailed apology and explanation for a two-day Windows Azure Storage disruption that affected its U.S. South-region customers in late December.

The service disruption happened on Dec. 28, 2012 and affected 1.8 percent of Microsoft's total Windows Azure Storage accounts, according to a Microsoft blog post. Thousands of businesses may have experienced problems with online services and Web sites, although Microsoft didn't provide a number. In May, a Microsoft official said that Windows Azure was supporting "high tens of thousands" of customers.

The Windows Azure Storage service was eventually fully restored on Dec. 30, 2012, but customers initially were kept in the dark about outage details for 1.5 hours. The problem was associated with a single "storage stamp," which is Microsoft's name for a regional unit consisting of multiple storage node stacks. Microsoft's Windows Azure cloud-based service typically depends on multiple storage stamps per region.

Windows Azure subscribers get updates about the service's performance through Microsoft's Primary Service Health Dashboard. However, during the time of this incident, they couldn't get the details for 1.5 hours because the Dashboard relied on the very storage stamp that had experienced problems, according to Microsoft's explanation.

"On December 28, 2012, from 7:30 am (PST) to approximately 9:00 am (PST) the Primary Service Health Dashboard was unavailable, because it relied on data in the affected storage stamp," Microsoft explained in the blog post.

Microsoft attributed the cause of the service disruption to human error, but it likely was an easy error to make, given the system's complexity, as described in the blog post. The problem arose because of the way storage nodes are brought back into service after being taken out for maintenance. A certain configuration that protects the nodes from being overwritten needs to be turned on when bringing the nodes back into service, but a technician forgot to turn on that protection, according to Microsoft. That error led to a node overwrite and service disruption.

The resulting two-day delay in restoring service was associated with Microsoft's attempt to restore the data at the failed storage stamp location with no loss of customer data. While Microsoft does have a georedundant service for Windows Azure that could have restored the data from another location, taking that approach would have lost about 8 GB of recent data for all of Microsoft's Windows Azure customers.

Microsoft's blog post indicated that the company would credit its Windows Azure Storage customers 100 percent for this service disruption in their December bills. Normally, Microsoft's service level agreement for Windows Azure Storage provides for a service credit of just 10 percent (99.9 percent uptime) or 25 percent (99 percent uptime).

Microsoft is also promising to improve the service in the future. It plans to improve its georeplication service to respond quicker should another such storage service disruption occur. Procedures associated with the Primary Service Health Dashboard failure have already been improved, according to Microsoft's blog post.

However, dashboard problems during a Windows Azure service disruption have been seen before. In February, the dashboard went down in association with a purported "leap year bug" service failure. The dashboard management service was restored after a near 24-hour blackout period.

About the Author

Kurt Mackie is senior news producer for 1105 Media's Converge360 group.

Printable Format

comments powered by Disqus

Featured

Low-Coding in the Age of AI: Dataverse Embraces Copilot, Claude and Cursor

Microsoft is extending Dataverse into coding-agent marketplaces while expanding its MCP tools, certification program and governance controls.
Visual Studio Takes Aim at Copilot Billing Shock

Beyond Copilot usage visibility, the June update delivers several other enhancements centered on AI-assisted development, security and quality-of-life improvements. Here's a quick rundown of the remaining additions announced by Microsoft.
Claude AI Gets Yet Another Boost in VS Code 1.128

The July 8, 2026, Visual Studio Code update expands agent workflows, chat attachments, browser-tab controls, OS-level shortcuts and enterprise telemetry management.
TypeScript 7 Arrives to Rock VS Code with Go-Powered Speed

Microsoft says TypeScript 7, announced July 8, brings native Go performance to VS Code, Visual Studio and other editors.