Software

Devops

This typo sparked a Microsoft Azure outage

Errant code fix deleted entire servers rather than snapshots of database


Microsoft Azure DevOps, a suite of application lifecycle services, stopped working in the South Brazil region for about ten hours on Wednesday due to a basic code error.

On Friday Eric Mattingly, principal software engineering manager, offered an apology for the disruption and revealed the cause of the outage: a simple typo that deleted seventeen production databases.

Mattingly explained that Azure DevOps engineers occasionally take snapshots of production databases to look into reported problems or test performance improvements. And they rely on a background system that runs daily and deletes old snapshots after a set period of time.

During a recent sprint – a group project in Agile jargon – Azure DevOps engineers performed a code upgrade, replacing deprecated Microsoft.Azure.Managment.* packages with supported Azure.ResourceManager.* NuGet packages.

The result was a large pull request of changes that swapped API calls in the old packages for those in the newer packages. The typo occurred in the pull request – a code change that has to be reviewed and merged into the applicable project. And it led the background snapshot deletion job to delete the entire server.

"Hidden within this pull request was a typo bug in the snapshot deletion job which swapped out a call to delete the Azure SQL Database to one that deletes the Azure SQL Server that hosts the database," said Mattingly.

Azure DevOps has tests to catch such issues, but according to Mattingly, the errant code only runs under certain conditions and thus isn't well covered under existing tests. Those conditions, presumably, require the presence of a database snapshot that is old enough to be caught by the deletion script.

Mattingly said Sprint 222 was deployed internally (Ring 0) without incident due to the absence of any snapshot databases. Several days later, the software changes were deployed to the customer environment (Ring 1) for the South Brazil scale unit (a cluster of servers for a specific role). That environment had a snapshot database old enough to trigger the bug, which led the background job to delete the "entire Azure SQL Server and all seventeen production databases" for the scale unit.

The data has all been recovered, but it took more than ten hours. There are several reasons for that, said Mattingly.

One is that since customers can't revive Azure SQL Servers themselves, on-call Azure engineers had to handle that, a process that took about an hour for many.

Another reason is that the databases had different backup configurations: some were configured for Zone-redundant backup and others were set up for the more recent Geo-zone-redundant backup. Reconciling this mismatch added many hours to the recovery process.

"Finally," said Mattingly, "Even after databases began coming back online, the entire scale unit remained inaccessible even to customers whose data was in those databases due to a complex set of issues with our web servers."

These issues arose from a server warmup task that iterated through the list of available databases with a test call. Databases in the process of being recovered chucked up an error that led the warm-up test "to perform an exponential backoff retry resulting in warmup taking ninety minutes on average, versus sub-second in a normal situation."

Further complicating matters, this recovery process was staggered and once one or two of the servers started taking customer traffic again, they'd get overloaded, and go down. Ultimately, restoring service required blocking all traffic to the South Brazil scale unit until everything was sufficiently ready to rejoin the load balancer and handle traffic.

Various fixes and reconfigurations have been put in place to prevent the issue from recurring.

"Once again, we apologize to all the customers impacted by this outage," said Mattingly. ®

Send us news
89 Comments

Microsoft kills classic Azure DaaS, because it isn't really Azure

Users get three-year deprecation and migration warning

Microsoft extends life support for aging Apache Cassandra 3.11 database

But only if you're ready to cozy up in Azure's abode

Microsoft attempts to woo governments with Cloud for Sovereignty preview

Sovereignty = you’ll run on Azure and you’ll be told when our engineers access your resources

From chaos to cadence: Celebrating two decades of Microsoft's Patch Tuesday

IT folks look back on 20 years of what is now infosec tradition

LinkedIn lays off nearly 700 staff, engineers to suffer the most

Time to update that resume on, er ... oh.

Calls for Visual Studio security tweak fall on deaf ears despite one-click RCE exploit

Two years on and Microsoft refuses to address the issue

Microsoft says VBScript will be ripped from Windows in future release

It's PowerShell or something similar in the not too distant future

Imagine a world without egress fees or cloud software license disparities

UK regulator lists series of potential remedies for anti-competitive practices early on in probe

Microsoft does not want ValueLicensing CEO anywhere near its confidentiality ring

Perpetual license case perpetually rumbles on

Brit watchdog slams Microsoft as it clears $69B Activision Blizzard buy

'Tactics employed by Microsoft are no way to engage with us'

Microsoft takes another run at closing Exchange brute-force security hole

Meanwhile, Exchange Online is on the fritz

Microsoft reportedly runs GitHub's AI Copilot at a loss

Redmond willing to do its accounts in red ink to get you hooked