An Amazon Web Services outage Sunday morning impacted consumer services such as Netflix, but the service disruptions...
were mild at most.
The issues began with the DynamoDB NoSQL database as a service. An Amazon Web Services (AWS) Service Health Dashboard update at about 5 a.m. Pacific Time (PDT) on Sept. 20 said the root cause began with a portion of the metadata service within DynamoDB -- an internal sub-service that manages table and partition information.
The exact problem with the metadata service was not identified, and Amazon declined to offer further details about what went wrong. AWS users including Netflix experienced intermittent errors, making the AWS outage a high-profile event.
Recovery efforts then focused on restoring metadata operations, and APIs were throttled as the recovery work took place, resulting in service disruptions and performance slowdowns in the U.S.-East-1 region of AWS. No other regions were affected.
While there were disruptions in service, most of them encompassed increases in error messages and slow performance when API calls were made between services, EC2 instances remained up and running, customers said.
"All of my EC2 instances and RDS instances stayed up, my websites, S3 buckets and CloudFront all continued to respond as well," said E.J. Brennan, a freelance developer based in Massachusetts who works with large enterprise clients.
The biggest affect Brennan saw was with increased error rates on the Simple Queuing Service (SQS).
"That prevented a handful of users from completing some non-critical tasks," Brennan said. "I will be looking into a redundancy option for that particular service down the road because that is not something I was prepared for, but no real harm was done."
Consultants working with dozens of clients also said all was relatively quiet on the AWS front Sunday morning.
"While most of our customers are in U.S.-East, not a single one was affected," said Glenn Grant, CEO of G2 Technology Group in Boston.
AWS outages used to be much more disruptive and frequent than they have been for the last year and a half; while Amazon suffered serious high-profile disruptions in its earlier years, it was tops for cloud uptime in a survey conducted by CloudHarmony last year, with 2.41 hours of downtime.
In fact, the most disruption enterprises felt from this AWS outage came from people who saw it as evidence that cloud computing is inherently more risky than on-premises deployments.
"We're going to have a bunch of flak today about AWS going down," said Jason McMunn, chief cloud architect at Ditech Mortgage Corp., based in Fort Washington, Pa. "There are some legacy people who don't understand the cloud, they're going to pick up on the headlines, and our point is that it's like seeing the freeway closed in California and people saying, 'You should not use freeways – freeways are scary, you should just use surface roads'."
Details of the AWS disruptions
Amazon first posted updates to its Service Health Dashboard at 3 a.m. PDT, indicating increased error rates with DynamoDB API requests in the U.S.-East Northern Virginia region. The first disruptions came at 2:13 a.m. PDT., with Cognito, DynamoDB, EC2, Kinesis, CodeCommit, CodeDeploy, Directory Service, Key Management Service and Elastic Load Balancer issues surfacing at that time.
For the next 30 minutes, the issues continued to cascade, affecting Lambda, Elastic MapReduce, CloudWatch, Amazon Workspaces and many other Amazon Services. In fact, the list of services unaffected by the disruptions Sept. 20 is shorter than the list of services involved. Unaffected services included API Gateway, CloudFront, Elasticache, Route 53, SimpleDB, CloudHSM, Data Pipeline, Direct Connect, IAM, S3, and Service Catalog.
All issues were resolved by 11 a.m. PDT.