freshidea - Fotolia


Best practices for AWS disaster recovery

Dan Sullivan reviews best practices for AWS disaster recovery to help users identify and plan for unexpected problems related to a disaster.

Disasters come in many forms, from fires and floods that damage data centers to power and hardware failures that disrupt operations. Amazon Web Services and other providers of infrastructure as a service allow virtually any business to implement disaster recovery without maintaining duplicate hardware in a secondary site.

Best practices for AWS disaster recovery (DR) in the cloud address compute, storage, networking, deployment, security and planning needs, and can help an organization avoid unanticipated problems in the event of an outage or disaster.

The first phase of DR planning is to identify your recovery point and recovery time objectives (RTO). A recovery point is the state of your systems that will be restored on your DR infrastructure. This is essentially the last point at which data and machine images were saved to your DR systems.

RTO is the time between losing and restoring services. One of the factors affecting recovery time is the time required to prepare machine images for use on DR servers.

If DR depends on backup files, your recovery point may be the previous night, when the latest backup was uploaded to cloud storage. If you are maintaining hot standby databases and constantly replicating data between databases, your recovery point would be the last transaction committed to the standby database.

Responsible imaging

You have several options for disaster recovery and business continuity. You could use machine images from the AWS catalog, but if the catalog changes, you may find that your preferred image is no longer available.

A second option is maintaining a copy of your preferred image in Amazon Simple Storage Service (S3). This gives you more control than if you were depending on a catalog image, but it also means you are responsible for patching and keeping the image up to date.

A third option is rebuilding images with the latest available components and libraries as part of your DR process. You may want to use Chef or Puppet scripts to automate the process. In this case, it's best to use the same scripts in your DR environment that you use in your production environment.

Another option for disaster recovery is to use Amazon CloudFormation, a service for launching multiple AWS resources. It offers templates for commonly used configurations, such as a single-server Active Directory or a LAMP stack.

Regardless of which tools you use, test your scripts regularly to ensure that your application stack runs as expected with the latest versions of the software used to build your images.

Amazon's Elastic Block Storage (EBS) is helpful for storing persistent data that is local to a server, and effective if you want to maintain a configuration database on one of your servers, store data files on an EBS volume and attach that volume to a new instance of a server. For data that needs to be accessible and independent of your servers, use S3.

Allocate elastic IP addresses and keep them for use with DR servers. Using known IP addresses in scripts can help streamline some recovery operations. For example, you could reserve a particular IP address for a database server and use that address in a Java Database Connectivity configuration.

If you use domain names in configuration settings, plan to map disaster recovery IP addresses to domain names. The Amazon Route 53 service can help with this by mapping domain names to the IP address of DR servers.

Plan, test and update

The main focus of AWS disaster recovery is often to get business operations back online as quickly as possible. This urgency causes some IT teams to overlook security. But if your business is subject to industry or government regulations, it has probably invested time, money and resources in compliance. Don't let your DR plan undermine those efforts.

Data copied to the cloud should be subject to security controls comparable to those found on premises. Files stored in S3 buckets should have the same effective access controls as copies of those files in on-premises storage systems.

IT teams should also plan for identity management issues. Will you replicate Active Directory in the cloud? If so, consider using federation between Active Directory and AWS with Active Directory Federation Services and Security Assertion Markup Language.

If you are lucky, you will never have to execute your DR plan outside of testing operations --but no one should count on that. The cloud makes it easy to start servers, allocate storage and configure network settings. Therefore, the cloud facilitates DR testing.

Also, don't assume that because your DR deployment scripts worked in the past, they will continue to work in the future. IT environments are too dynamic to assume that level of consistent reliability.

Dig Deeper on AWS disaster recovery