Q

AWS outages: What's behind; what to do about failures?

Learn what's behind AWS outages and how to fix failures before they happen.

Should my organization be concerned about whole zone failures in Amazon Web Services (AWS)? How should organizations

make good decisions about cloud service availability?

If "almost-always available" is good enough for your service, then there's no need to be concerned with whole zone failure. But when is "almost always available" good enough?

Let's start with the basics. AWS services are available on a geographic basis within regions. Regions are the largest service availability grouping. Within regions are availability zones, and there are multiple zones per region. Availability zones are isolated instances, but according to AWS documentation, are connected by low- latency links.

This regional approach can be a good thing if customers have requirements that their data not leave a national boundary. Then, the application or service provider can provide a compliant offering.

Unfortunately, not all regions are equally reliable. Looking at some AWS outages, we can see that zone issues are often a root cause. Here are some examples:

  • April 2012: A human-error networking issue during service caused Amazon Elastic Block Store (EBS) volumes to go down. Network availability was interrupted on a massive scale in US EAST-1. When availability returned, EBS services frantically looked for locations to mirror data -- the scale was so large that most of the EBS servers got "stuck," which is AWS' term for locked-up or non-responsive. The network service was a non-event, which led to massive downtime.
  • June 2012: Due to a storm on the East Coast, most US EAST-1 data centers experienced power fluctuations. All but one recovered gracefully, but in the single datacenter -- one of two serving the same availability zone in US EAST-1 -- the generators didn't come up to full power and the servers were running on the uninterruptable power supply (UPS). When UPS failed, the servers went down. This cascaded to bring down EBS and the Amazon Elastic Load Balancer (ELB) in the entire US EAST-1.
  • October 2012: EBS data collection agent in US EAST-1 failed and got locked up. The server was replaced rather than rebooted, and DNS was updated to point to the new server, but the DNS update replication process failed and numerous systems continued to look for the old, "stuck" server. This failure to find the server triggered a memory leak bug, which consumed memory on servers until the servers were no longer able to process valid requests. This, in term, affected the overall availability of EBS in the zone.

AWS outages in 2013, such as the August 25 failure, were similar to those in 2012 in one aspect: the unreliability in US EAST-1.

A reason for concern is the impact of failures in a single zone may not be isolated to a small zone, and can affect multiple zones within the region. Also, failures always seem to impact EBS, posing risks to components.

CEOs and enterprise architects evaluating AWS have to decide if "almost always" is a sufficient availability strategy. Sometimes it can be. As a security researcher, I occasionally need a powerful machine to handle research tasks -- scanning, complex calculations, or brute force reconnaissance --against servers owned by customers who have requested these services. Using AWS servers has always worked well for me, because if a server is unavailable for a few hours, it's not a major problem for those customers. On the other hand, if I'm offering a cloud-based security monitoring and detection service that must be available 24/7 at 99.99%, I cannot rely on AWS for that offering.

What strategy can AWS customers take to minimize downtime? The first step seems obvious: Only use US EAST -- which, at press time, has the lowest service cost -- for items that only need to be available "almost always." For critical services, use another region.

Many customers of mine have implemented development environments in US-EAST while running operations in one of the WEST regions. Secondly, design, implement and test not just inter-zone failover, but test cross-region failover. The costs associated with cross-region failover can be higher than inter-zone, due to networking costs related to data replication.

The big decision criteria must be the price the provider and customer puts on high availability. If it's the heart of either business, be prepared to pay a larger price for greater availability.

Moving to the cloud is a definite paradigm shift. Operations personnel need to stop thinking in terms of cross-server availability and the availability of given servers, and start thinking in terms of services. For AWS clients, that means thinking in terms of cloud service availability -- such as EBS, ELB and Relational Database Service (RDS) instances.

When planning a high-availability service, don't just look at the services, but look at dependencies running at the root of the desired service. For instance, RDS runs on top of EBS, and therefore has a dependency. Matrixing service requirements, dependencies, availability zones and regions can be a complex operation at first, but one that pays off in spades down the road.

This was first published in April 2014

Dig deeper on AWS security

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

Have a question for an expert?

Please add a title for your question

Get answers from a TechTarget expert on whatever's puzzling you.

You will be able to add details on the next page.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

SearchCloudApplications

SearchSOA

TheServerSide

SearchSoftwareQuality

SearchCloudComputing

Close