freshidea - Fotolia

What's the best disaster recovery strategy for AWS outages?

With recent news of AWS outages, we want to ensure our enterprise is ready for anything. How should we design a disaster recovery plan to minimize outages?

AWS outages happen. And while the vast majority of outages are relatively short -- just a matter of hours over the course of a year -- each occurrence corresponds to a tangible loss of productivity and revenue for organizations that depend on the public cloud.

Providers such as Amazon Web Services (AWS) are quick to downplay outages, often describing them as "incidents" instead. But no matter how you label service outages, they often are unavoidable -- even with workloads in mature cloud providers. AWS customers must understand the disaster recovery (DR) and business continuity (BC) options available for applications running in public cloud.

One practical approach to DR and BC with cloud services is to implement a multisite strategy that runs critical workloads in an active-active configuration between the enterprise and cloud provider. This means a critical workload is configured to run simultaneously in both the local data center and public cloud, which is configured to duplicate the local production environment.

For example, consider a critical enterprise workload that requires both an application server and database server. A service such as Amazon Route 53 DNS can channel traffic to both local and cloud sites, and the enterprise can determine how much of that traffic should go to a certain location, allowing AWS to handle more or less of the total load. The traffic directed to each site is processed through a load balancer and proxy server, and then passed to an application server, which also interacts with a database server. It's usually possible for one site to share the database during normal production, keeping the duplicate database synchronized -- a master-slave database relationship.

In enterprises that operate some workloads in AWS and some on-premises, data is synchronized and traffic is shared between the local data center and AWS. When a disruption occurs -- at either the local or cloud site -- all user traffic will fail over to the remaining site. When AWS outages are resolved, data is re-synchronized and traffic fails back -- allowing both sites to share the user load again.

It's important for organizations to consider the costs of such an active-active configuration. Costs are usually less during normal operations because the AWS deployment is only handling a portion of the total traffic load, but the actual traffic level and corresponding costs can be adjusted over a wide range, depending on enterprise needs and preferences. There's no rule that says you need to split the traffic 50/50. AWS can handle most of the production traffic or only a small part of the production load, which affects the number of compute instances employed and the choice of database replication methods.

Organizations can invoke AWS Auto Scaling to ramp up compute resources when AWS needs to meet the full traffic load, and then scale back when the companion site is restored. However, the local data center must also have the scalability to handle the full traffic load in response to AWS outages -- or any other public cloud provider outage.

Next Steps

AWS approach to DR helps spur cloud acceptance

Tricks for using Elastic Load Balancing in AWS

Disruption to AWS products worries customers

Dig Deeper on AWS disaster recovery