alphaspirit - Fotolia
AWS has proven to be a reliable and cost-effective public cloud resource for large and small businesses. But it isn't perfect.
According to data from CloudHarmony, a cloud performance analyst now owned by Gartner, Amazon was generally one of the most reliable cloud providers in 2015, but there were occasions when key services were offline for as much as an hour. And other, longer outages have also been recorded.
So, even though the likelihood of suffering a significant outage is small, any IT pro knows that just a few minutes of downtime can seem like an eternity when end users -- accustomed to instant gratification -- cannot get what they want. There are a few disaster recovery (DR) measures IT teams can take to survive a cloud outage.
Last September, AWS suffered a disruption in its North Virginia data center when one of its major platforms failed. That issue lasted for approximately five hours. For most companies who deploy their products to AWS, those would have been five very long and financially painful hours. Yet, one of the businesses directly affected, All Star Slots, came back online because it planned for failure by using AWS DR tools.
AWS DR is designed to handle failures, and it offers tools and functions that allow developers and infrastructure engineers to build systems that can withstand failures of entire AWS regions, or even continents, explained Chris Kay, CTO of All Star Slots, an online casino. On top of that, there are tools that allow for automated failure testing on both development and production environments to simulate failures of large and small scales. Kay recommended using those tools to speed the deployment of new servers, "automatically, if needed."
"Outages can occur in any environment, and AWS is no exception," said Aater Suleman, CEO of Flux7, an IT consultancy in Austin, Texas. What makes AWS different is that this chance of failure is not hidden; it is acknowledged and designed for.
Suleman agreed that AWS DR tools help manage outages. "It is up to the user to take advantage of these tools," he said. For example, to protect instances, AWS provides Auto Scaling groups and a new instance restore feature, which allows developers to restore a single Elastic Compute Cloud instance, with original disk volume and network settings.
Likewise, to handle failure of an entire availability zone (AZ), which is roughly analogous to a data center failure, AWS Auto Scaling groups can perform automatic failover with no human intervention, Suleman said.
Oscar Moncada, the director of technology operations at Events.com, said that prior to an outage, it is wise to "build a highly available infrastructure utilizing multiple AZs and regions." If businesses do this correctly, he said, they won't have to worry about the "during" and "after" phases of an outage.
Use multiple zones to avoid danger
During an outage, IT teams should also plan to start up servers on a second AZ or region to compensate for the outage, and then redirect traffic using Amazon Route 53, an available and scalable cloud domain name system Web service, Moncada said. Then, in the post-outage phase, slowly redirect traffic back to the original region or AZ.
"Ideally, you want to have infrastructure in multiple AZs and regions before an outage happens," Moncada said. This guarantees high availability and redundancy in the case of an outage, regardless of the length. If budget is a concern, host the main infrastructure in one region using multiple AZs, and then have a smaller version of the infrastructure on a different region and keep it on standby or stopped, Moncada added.
"In the case of an outage in your main region, you can turn on the standby infrastructure and temporarily redirect traffic to it using Route 53," Moncada said. "This can all be scripted to happen automatically, allowing you to sleep better at night."
That was more or less the approach at All Star Slots. Kay not only used high availability at an AZ level within AWS, but did it at a regional level -- replicating content, data and applications across numerous regions and continents. The availability of these resources is actively monitored -- not just for failure, but also for degraded service or performance. When failures arise, traffic is routed intelligently to an AWS region that is working.
Once the issues are resolved, traffic is rerouted automatically to the original location. Data is automatically synchronized between continents, and "life continues as normal," Kay said.
All Star Slots automates the testing of such systems by "programmatically destroying our infrastructure -- on purpose -- and watching as our infrastructure self-manages the change in resources available to it and self-heals where the damage occurred," Kay added. Everything is managed and controlled via code automatically, "without a human lifting a finger."
During the September AWS outage, the company was notified of the issues within the AWS infrastructure and platforms automatically via pager and information flowing into Splunk. The All Star Slots IT team then monitored network traffic as it was pulled from the failing region and passed over to other regions automatically.
No actual action was taken by a single staff member at the company. After AWS resolved the issues, the infrastructure self-healed, synchronized and then was automatically made available again.
"Obviously, having such a large and active infrastructure can be expensive," Kay said. "However, there are some very creative and fantastic ways of keeping your costs managed while ensuring performance remains high."
Reputation of AWS U.S. East-1 region questioned
AWS reliability worth the risk of outages
AWS outage leaves customers concerned