Microsoft disclosed that the five-hour outage to Microsoft 365 worldwide on July 21 was due to an incorrectly built update that company engineers installed while working on an internal central configuration repository. The incident affected many of the company's customers and cloud services, which also use the Enterprise Configuration Service (ECS).
Read also: best keyboard for programming
Microsoft centralized cloud services allow you to make large-scale dynamic changes to the operation of services and cloud applications, as well as targeted changes, such as specific configurations for each partner, enterprise client or group of end users, using special tools. Misconfiguration of the central corporate ECS configuration service resulted in cascading outages in Microsoft 365 services and affected enterprise users in multiple regions.
Moreover, the incident was initially minor. It looked in the logs as a crash in Microsoft Teams. Then its influence began to expand dramatically to other cloud services. The outage ultimately affected several Microsoft 365 services with Teams integration that also use ECS, including Exchange Online, Windows 365, and Office Online.
As a result, corporate users around the world have begun reporting to Microsoft that they cannot use Microsoft Teams and several other Microsoft 365 services or features.
"This issue impacted the ability for users to connect to the desktop, web, and mobile clients of Microsoft Teams," Microsoft explained in its preliminary report.
Microsoft said telemetry identified 300,000 affected customers. Companies in the Asia Pacific (APAC) region were the hardest hit, as it was a business day there at the time of the outage. European and American customers were less affected. In addition, affected corporate customers experienced the most problems with Direct Routing and Skype MFA services not working.
According to a Microsoft report, the crash occurred on July 21 at 4:05 am Moscow time. The company's engineers fixed most of the problems associated with it within five hours. By 9:00 Moscow time, Microsoft 365 was back to normal. Some customers were still seeing residual issues with Microsoft cloud services until 16:00 Moscow time.
The investigation revealed that the incident affected customers who were attempting to use one or more of the following Microsoft 365 services and features (all of which were affected to varying degrees by the outage):
- Exchange Online (there was a delay in sending mail);
- Microsoft 365 admin center (access was denied);
- Microsoft Word on multiple cloud services (didn't download);
- Microsoft Forms (inability to use through Teams);
- Microsoft Graph API (any service using this API was affected)
- Office Online (there were problems accessing Microsoft Word);
- SharePoint Online (there were problems accessing Microsoft Word;
- Project Online (access was closed);
- PowerPlatform and PowerAutomate (inability to deploy a new environment using databases);
- automatic updates in Microsoft Managed Desktop (access was denied);
- Yammer (had problems launching Yammer);
- Windows 365 (cannot add or create new cloud PCs).
“The update to deploy to our ECS service contained a code defect that affected backward compatibility with other services using ECS. The end result was that services using ECS returned incorrect configurations to all of their connected partner services, ”the company admitted. “This resulted in downstream company and cloud services connected to Microsoft receiving a 200 status message indicating that the configuration pull was successful. In fact, it contained a distorted and non-working JSON object," the company's experts specified.
Microsoft stated that as a result of this incident, the company will improve the failover mechanism of the Microsoft Teams service to fall back to a cached version of the ECS configuration in the event of a similar failure with future ECS updates. The company will also implement tools to further isolate faults to limit their impact, as well as adjust monitoring thresholds to better detect these low-level faults in the early stages of their occurrence.
Sources: Python.Engineering, BleepingComputer.com