AWS attributed Tuesday’s extended disruption to outdated processes and human error, according to a postmortem published Thursday.
The post, which classified the incident as a “service disruption,” states that the problem started at 12:37 p.m. ET when an authorized Amazon Simple Storage Service (Amazon S3) team attempted to resolve an issue that had caused the S3 billing system to behave slower than expected. One of the team members, following AWS guidelines, attempted to execute a command that would remove some of the servers for an S3 subsystem, but incorrectly entered one of the inputs.
As a result, too many servers were taken down, including those that supported two additional S3 subsystems that manage metadata and location information, as well as the allocation of new storage. Compounding problems and creating a Catch-22 scenario, the latter subsystem required the former to be in operation. The capacity removal required a full restart, and Amazon S3 was unable to service requests.
It appears the outage was due to fat fingering under pressure with an arcane, hardly used command, said Mike Matchett, senior analyst and consultant at Taneja Group.
“It need not ever have happened, but was too easy a mistake to make,” he said. “Once made, it cascaded into a major outage.”
The impact spread to additional services in the US East-1 region that rely on Amazon S3, including the S3 console, new Elastic Compute Cloud instance launches, Elastic Block Store volumes and AWS Lambda.
AWS’ system is designed to support the removal of significant capacity and is built for occasional failure, according to the company. It has performed this particular operation since S3 was first built, but a complete restart of the affected subsystems hadn’t been done in one of the larger regions in years, the company said.
It’s surprising that a system like AWS is vulnerable at this scope to manual errors, said Carl Brooks, an analyst with 451 Research. The initial failure is understandable, but the compounding impact shows a larger structural flaw in how AWS manages uptime, he added.
“It says something for one manual process to have that much disruptive effect,” Brooks said. “They claim they’ve been working against [these types of failures] all this time, and clear that work has not been completed.”
Amazon S3 fully recovered by 4:54 p.m. ET, and related services recovered afterward, depending on backlogs.
AWS said it has since modified the tool it uses to perform the debugging task to limit how fast it can remove capacity and to put in additional safeguards to prevent a subsystem from going below its minimum capacity. It’s also reviewing other operational tasks for similar architectural problems, and plans to improve the recovery time of critical S3 subsystems by breaking down services into smaller segments and limit the blast radius of a failure, the company said in the postmortem. That work was planned for S3 later this year but that timeline was pushed up following this incident. Company officials did not comment specifically on when this work would be completed.
Critical operations that involve shutting down key resources should be fully scripted and tested often, and it shows a level of hubris that AWS hadn’t tried this restart in years, Matchett said. He was also critical of having the health dashboard connected to S3, saying the management plane for high-availability workloads should have been on a completely separate one than the resources it was controlling.
“It really looks like AWS has built a bigger house of cards than even they are aware of,” Matchett said.
Changes also have been made to the Service Health Dashboard to run across multiple regions.