Despite the benefits of a move to the cloud, providers like AWS can't guarantee trouble-free IT management. Errors...
and downtime can cause cloud disruptions -- even when enterprises thoroughly plan a cloud project and take a realistic look at capabilities and terms of service. Enterprises that prepare for eventual AWS downtime can mitigate its affects; those who don't are in for a rude awakening.
Each organization will have different thresholds for acceptable AWS downtime, depending on its needs and the business case, said Kiyoto Tamura, vice president of marketing at Treasure Data Inc., a software company based near San Francisco. But when a critical service like Amazon Simple Storage Service (S3) goes down, everyone feels it. "When [S3] went down earlier this year, literally one half of the internet had trouble," Tamura said. "It's a testament to S3's popularity, as well as the danger of putting all your data in one system -- no matter how durable and available it's designed to be."
And what's an acceptable level of downtime depends on the workload or business function the infrastructure supports. Some business functions, such as online retail, banking and emergency services, require 100% availability; other business functions -- HR and some business intelligence apps -- can operate in an environment that provides less than 100% availability. It's up to each organization to determine workload availability requirements and implement the appropriate architecture -- a task in and of itself.
"An architect must have a deep understanding of the AWS environment to properly architect a highly available infrastructure," said Paul Duvall, CTO with consultants Stelligent Systems Inc.
Brett MossSVP and General Manager of Hyperscale Cloud at Ensono
Several AWS features and capabilities rely on other AWS features. Cloud architects need to understand these interdependencies to avoid poor results, as AWS customers experienced with the recent Amazon S3 disruption. An issue with the S3 environment brought down other AWS utilities, including the CloudFormation environment that S3 supports.
"Understanding the CloudFormation dependency on S3, an architect could have designed some redundancy," Duvall said. "What IT professionals need to remember is that it doesn't matter what platform the application is running on. It is the responsibility of the IT department to design and implement an architecture that ensures the viability and resilience of an organization's mission-critical applications and data."
Set expectations for AWS downtime
Most AWS customers expect somewhere between 99.9% to 99.999% uptime, which translates to roughly 53 minutes to 8.76 hours per year of allowable outage time. Most customers are unhappy when the annual downtime is concentrated into a single event, said Simon Jones, evangelist and director of marketing communications at Cedexis. So, when S3 went down in late February, the length of the outage caused angst among AWS customers.
"For many AWS customers, the reality is that most will tolerate some downtime almost fatalistically," said Avi Freedman, co-founder and CEO of Kentik Inc., a network traffic intelligence company based in San Francisco. "Zero downtime is achievable, but only with investment in architecture and layered or multivendor solutions."
Take action to boost uptime
Users can take several steps to reduce or minimize the risk of AWS downtime, including:
- Avoid putting workloads into the AWS U.S East region, if at all possible. "It's the biggest, oldest and most unreliable zone," Freedman said.
- Implement a multiregional architecture, as network and configuration failures regularly cascade into regional outages for the three major cloud providers -- AWS, Azure and Google.
Don't rely on AWS to deliver 100% of packets reliably over the internet.
- Regularly monitor the performance and availability of content delivery networks, as well as hybrid and multicloud environments.
"We see more people working with service providers to avoid downtime," said Brett Moss, SVP and General Manager of Hyperscale Cloud at Ensono, an infrastructure management company in Downers Grove, Ill. And with hyperscale clouds like AWS, enterprises can generally architect environments to avoid downtimes, he added. But there are additional costs for this.
The largest AWS customers likely have the skills and knowledge to fully back up workloads. But most customers move to the cloud to reduce costs; the full implications of that aren't always thought through.
Acceptable cloud downtime and error rates also depend on the nature of the application. For example, a real-time messaging app such as Slack has very little tolerance for downtime, but a batch-image processing app can be down for some time before it negatively affects customers, said Greg Arnette, founder and CTO of cloud-based email archiving and analytics platform Sonian.
"Successful AWS architectures plan for downtime of specific cloud services and can harness the failover and redundancy that AWS offers to provide continuity," Arnette said. AWS customers with resilient architectures plan for some amount of error and build retries into their code. But, generally, "if an error from an API call persists for over five minutes, that would be unacceptable for many customers," he added.
"All AWS customers probably care about downtime, but probably only a third really understand how to avoid it," Moss said. Additionally, some companies learn that certain functions can tolerate lower performance and downtime, saving money on disaster recovery efforts.
S3 users felt brunt of AWS disruption
Elasticity added to Amazon Elastic Block Store to reduce downtime
Learn how to minimize risks from cloud outages