Gajus - Fotolia

Amazon S3 issues at the center of Monday morning snafu

AWS customers' ability to launch new instances in the US-East region was affected by an uptick in error rates on the Simple Storage Service, according to Amazon's status page.

Amazon S3 had issues with increased error rates in the U.S.-East region in Northern Virginia Monday morning which affected the creation of new instances.

Amazon Web Services' (AWS) Simple Storage Service (S3) began to show increased error rates around midnight Pacific Time (PDT), according to AWS' Service Health Dashboard.

"Between 12:08 AM and 3:40 AM PDT in the Amazon S3 US-STANDARD Region (i.e. U.S.-East), Amazon S3 experienced elevated error rates due to a configuration error in one of the systems that Amazon S3 uses to manage request traffic," according to information posted on the dashboard.

The issue was compounded by the fact that AWS initially pursued the wrong root cause. Once the true root cause was found, Amazon resolved the issue "relatively quickly" and restored normal operations, according to the dashboard.

However, during the time of the event, the elevated S3 API error rate impacted services that depend on S3 such as Elastic MapReduce, which relies on S3 for object storage, and the Elastic Compute Cloud (EC2) which relies on S3 to store some Amazon Machine Images. The Relational Database Service, ElastiCache, Elastic Load Balancing, CloudSearch and CloudTrail in U.S.-East (Northern Virginia) also showed increased latencies during this time.

Some customers had difficulty launching new EC2 instances in US-East. Others reported slowdowns with Amazon S3.

It seems clear that far too many people are designing for AWS to be up all the time without any increased latency or increased error rates.
Joe EmisonCTO, founder of BuildFax, Inc.

"Our customers have only been slightly affected [by] some slower response times between 3 a.m. and 5 a.m. Eastern, but since then the performance has been fine for us," said Joe Emison, CTO and founder of Asheville, N.C.-based BuildFax Inc.

The events were a reminder that it's important to design cloud applications that take infrastructure failures into account, Emison said.

"Even though AWS tells everyone to design for failure, it seems clear that far too many people are designing for AWS to be up all the time without any increased latency or increased error rates," Emison said. "Our code auto-retries connections and commands to S3, and so that’s why the main outcome we’ve seen is a slight slowdown as opposed to complete failure."

Analysts said AWS' response to the problem has been much improved since the early days of cloud outages.

"This is nearly a gold standard for transparency and communication for cloud providers," said Carl Brooks, analyst with 451 Research based in New York. "About the only way they could improve would be to give out more details about their data center operations, but this is the way to handle events like these."

Amazon did not respond to requests for comment.

Beth Pariseau is senior news writer for SearchAWS. Write to her at [email protected]or follow @PariseauTTon Twitter.

Next Steps

AWS outage worries enterprise customers

AWS outages: What's behind; what to do about failures?

The AWS role in disaster recovery

Dig Deeper on AWS support, licensing and SLAs