Helder Almeida - Fotolia
AWS suffered its first major disruption in more than a year on Tuesday -- and it couldn't have happened to a worse service or in a worse location.
Problems with Amazon Simple Storage Service (S3) started in the U.S. East-1 region sometime around 12:30 p.m. EST Tuesday, and the AWS disruption lasted more than four hours before being fully resolved. S3 is integral to multiple AWS cloud services, and the region is one of the major hubs for its customers' data.
According to AWS, as of 1:49 p.m. PST, "We are fully recovered for operations for adding new objects in S3, which was the last operation showing a high-error rate. The Amazon S3 service is operating normally."
"U.S. East-1 isn't a good region for Amazon to experience an outage, because it's definitely one that gets a lot of use," said Jason Read, founder of CloudHarmony Inc., a Gartner-owned company in Laguna Beach, Calif., that monitors public cloud uptimes.
Amazon S3 object retrieval, listing and deletion were fully recovered by 4:12 p.m. EST, but AWS hadn't recovered normal operations for adding new objects to Amazon S3 as of the publication of this report.
U.S. East-1 is the original and oldest AWS region, and it's believed to be one of the largest in terms of client usage. The disruption also highlights the reach of AWS, as large swaths of the internet were affected Tuesday afternoon.
Within two hours of the start of the disruption, AWS said it had identified the root cause of the problem -- though it didn't publicly identify it -- and was working to resolve the issue.
"It's primarily a slowdown, but there are some services that are in trouble right now," said Dave Bartoletti, principal analyst at Forrester Research.
That AWS disruption included query failures on some database services, as well as issues with Amazon Simple Email Service.
The most important thing is there have been no reports of data loss, but businesses large and small are certainly being affected, Bartoletti said.
User gripes about Amazon S3 accessibility were notable on social media before there was any way to detect a problem on the AWS website, which was temporarily unavailable. The AWS health dashboard was accessible, and status icons remained green, despite the continued disruption. That, apparently, was because the icons themselves were stored in the very region experiencing the problems. A temporary banner was placed at the top page to acknowledge the issue, and the dashboard wasn't fully functional until two hours after the problems started.
CloudHarmony measures uptime for AWS in multiple ways, and it was unable to reach any services during the disruption. Read described the AWS disruption as an outage, but Bartoletti disagreed.
"In my view, an outage is: 'I can't reach it, it's dead,'" Bartoletti said. "That's not the case here. It is incredibly slow, and some people can't reach it at all. Some are more affected than others."
The delineation between calling it a disruption or an AWS outage may seem like a semantics argument, but it matters to service-level agreements and to credits that customers receive when AWS misses its uptime requirements.
The last major AWS disruption came in August 2015, which CloudHarmony tracked at just under 25 minutes. Since then, there have been no AWS outages in any S3 region that extended as long as this latest one, Read said.
Update: What really happened? Trevor Jones digs into the causes -- and Amazon's explanation -- of the recent AWS outage heard around the world on the AWS Cloud Cover blog.
Trevor Jones is a news writer with SearchCloudComputing and SearchAWS. Contact him at firstname.lastname@example.org.
Cloud DR may not be an easy process, but it's necessary
U.S.-East-1 region takes heat after disruptions
Navigate through AWS availability zones and regions