News

How much are users to blame for their AWS outage woes?

Adam Riglian

The storm that rolled through the mid-Atlantic region at the end of June left people without power, telephones and, perhaps most importantly, without Netflix and Pinterest.

Power outages combined with a failure in the generators at one of Amazon's East Coast datacenters led to an Amazon Web Services outage that affected several high-profile customers and many more run-of-the-mill users. Most were back online within six hours, but the hailstorm directed at Amazon continues.

"To some extent this was a natural disaster that no one could anticipate," said Jeff Kaplan, managing director of Wellesley, Mass.-based consultancy THINKstrategies Inc. "Every service provider, whether it's traditional telcos like Verizon, power companies or cloud providers like Amazon, were going to be threatened and vulnerable."

Kaplan added that in an age where cloud services are becoming increasingly important, "you would hope they would have greater safeguards in place."

In this instance, Amazon's safeguards for power outages failed. One of two datacenters in an Availability Zone didn't successfully switch to generator power, causing the problem. While Amazon has been transparent and apologetic since the outage, others are questioning how much blame falls on users.

Read more about Amazon Services

Understand Elastic Load Balancing

One-click cloud enters AWS Marketplace

IT pros bypass management, setup AWS clouds

Tel Aviv-based cloud usage analytics outfit Newvem monitors AWS usage and informs users of any problems. In closed beta since April with 500 participants, Newvem has gathered data showing nearly half of enterprises aren't utilizing Amazon correctly.

"We found that there's a pretty big gap between the best practices Amazon will recommend to [its] users and users finding them and implementing them," said Cameron Peron, vice president of marketing and business development at Newvem.

Elastic load balancing (ELB) is the function in Amazon that automatically distributes traffic across a broad number of instances, with the goal of making it harder for the application to fail. Newvem reports that 20% of users haven't correctly configured ELBs, including 27% of first-time Amazon users. Additionally, 40% of users did not back up their data with Elastic Block Storage, meaning their data, applications and infrastructures were exposed to complete loss in the event of an outage.

The statistics muddle the narrative around the safety of cloud computing, suggesting that using the technology wrong is more problematic than the technology itself.

"A lot of people, they just don't know about the best practices, that's the first thing," Peron said. "The second thing is they don't know where in the cloud they need to implement it. Amazon will say that [the information is] on their site and it's available, but when you have a DevOps guy or an IT manager who's doing a million other things, he can't sift through all the information from Amazon."

Kaplan believes the reason users aren't properly configuring their Amazon setups is because they either do not understand the reasons for doing so or believe they can get away with the risk and instead choose to save money on services that would protect them. To fix the problems of the first group, Kaplan thinks both sides need to take responsibility.

"Amazon is not required to provide that redundancy, but it never fully explained to its users the benefits of acquiring a level of service in which that redundancy is built in," he said.

When Newvem reports problems to its users, many have no idea the problem existed or that there was a way to fix it, often with free services. Peron said his company will often tell clients about glaring problems, like having data ports exposed to the open Internet, only to find the client doesn't fix the problem.

"No matter how good the technology gets, it still depends on humans to properly deploy it," Kaplan said.

Putting the latest outage into perspective

Amazon has had outages in the past, most notably in April 2011, and, given that failures like the one that caused the most recent outage are unpredictable, it's possible that another could happen in the future.

"With every event of this nature, you would hope that the overall service provider community is going to get smarter and better," Kaplan said. "Failure is tolerable as long as it's not regular."

He notes that while Amazon's outages have created a very public stir about the risk factors involved in cloud computing, people are still voting for the technology with their IT dollars.

"These incidents aren't creating a backlash, it may be slowing some of the movement to [cloud services], but it's not causing people to abandon them in droves," he said.

Part of the reason is many see the advantages of cloud far outweighing the risks. Many do not believe they could re-create a better data center environment in-house, either because of expense or lack of know-how.

"It's always been the assumption that AWS or any cloud vendor is far better in its uptime or availability than traditional data centers are in uptime and availability,' he said.

Kaplan added that even users who are leaving Amazon over the outages aren't necessarily quitting the cloud altogether, but merely moving to Rackspace or another competitor.


Join the conversation Comment

Share
Comments

    Results

    Contribute to the conversation

    All fields are required. Comments will appear at the bottom of the article.