Large infrastructure as a service vendors like AWS offer consistency, reliability and redundancy, making it easy...
for some enterprises to assume cloud deployments are impervious to failure. But, while AWS may seem bulletproof, it's not idiot-proof.
Even though AWS ensures that its infrastructure is almost impervious to failure, outages happen. And companies still make bad decisions that put the environment at risk. Companies shouldn't ignore putting together an AWS backup strategy.
Zones, regions and AWS redundancy
AWS provides a highly redundant and fault-tolerant design, but that doesn't mean customers are automatically protected against all failures. IT teams must understand AWS' distributed infrastructure and the public cloud provider's concept of availability zones (AZs) and regions.
Most AZs are a collection of multiple data centers located geographically near one another, although typically not on the same campus. Each AZ has redundant power, networking and connectivity -- allowing it to continue service even if a single data center suffers a catastrophe.
Regions are clusters of AZs in the same geographic area, such as US-East or US-West. Many services, such as Amazon Aurora, automatically replicate data across several AZs within a region, creating even more redundancy.
AWS doesn't guarantee specific uptime in its SLA, but it is rare for an AZ to experience downtime. A glance at Cloud Harmony's monthly availability of AWS tools shows that most AZs had 100% uptime over the past 30 days -- the worst downtime was offline for only a couple of minutes. Still, it's possible to put single instances of a critical resource or IT service in a single AWS zone and have an outage take out an entire application.
Enterprises should also consider automation, which can reduce human errors when it comes to backup and disaster recovery (DR). For example, more than five years ago, Netflix concluded that its use of Elastic Load Balancing to distribute traffic across zones wasn't optimal and automation would help prevent it. At the time, Netflix was engineering its systems to work across multiple zones, but concluded that its inter-region migration process required too many manual steps.
Craft an AWS backup strategy
Cloud compute instances are stateless, which means that protecting native AWS resources becomes a data backup problem. The approach depends on the type of data, which falls into four categories: files, volumes, images and applications.
The most general-purpose approach is to use Elastic Block Store (EBS) snapshots to save block or file systems volumes. Snapshots are incremental backups of changed blocks stored in S3 that can be restored into an EBS volume in the same region or copied across regions and restored to a DR site. EBS snapshots also back up EC2 application images that are copied across regions and used to spawn new EC2 instances.
EBS snapshots work for backing up files or self-managed databases, but native AWS databases have built-in backup features. For example, Amazon Relational Database Service (RDS) automatically creates daily volume snapshots of database instances that are kept for 35 days.
Likewise, Amazon Aurora automatically stores six copies of data across three AZs; it automatically attempts to recover databases in a healthy AZ with no data loss. IT teams can combine the features of RDS logs and EBS snapshots to enable database recovery to any point in time over a five-week span. They can also manually initiate RDS or Aurora snapshots that are kept indefinitely.
IT teams can build a reliable and resilient application infrastructure purely on AWS, but using multiple clouds or a private data center as part of a backup and DR strategy makes sense in certain situations.
A large, region-wide AWS failure would create capacity constraints in other regions in the same geography, as customers would spin up new instances in other regions. We don't know how much spare capacity AWS has on hand, but it's conceivable that customers in such a scenario could hit limits.
Additionally, some services, like RDS and S3, can automatically replicate across regions. While this is generally beneficial, it means a fatal software bug or accidental or malicious data deletion can instantly propagate around the globe and take out an entire application. Having workloads split across AWS and Azure is a safeguard, as the bug likely won't hit both simultaneously.
AWS doesn't include native services to replicate data to other cloud stacks, but third parties such as CloudBasic, CloudEndure and Zerto do. These software as a service products act as data brokers with agents that run in each cloud environment and automate volume snapshots, replication and restoration, if necessary, onto database services or compute instances on another cloud.
An alternative is to use on-premises infrastructure in an AWS backup strategy; most organizations have an existing backup platform for legacy systems. This local infrastructure extends to AWS using a virtual private network or AWS Direct Connect circuit. In this scenario, the developer treats AWS instances or databases as if they were on-premises servers and installs the appropriate backup agent software to manage them from the existing backup servers and admin console.
Recommendations for AWS DR
When designing an AWS backup strategy, decide how committed your enterprise is to AWS. AWS-based businesses can build a DR system that automatically replicates databases and machine images to remote regions that won't be affected by local outages. From there, it takes some process automation using AWS APIs and pre-disaster testing to ensure that IT teams can successfully clone complex applications to another infrastructure.
Here are five best practices for AWS customers designing a DR strategy:
- Designate a backup location for AWS-based applications and configure infrastructure that can be rapidly and automatically deployed. Typically this will be another AWS region, but it could also be another cloud -- Azure, Google, or an internal virtual infrastructure stack like vSphere, Azure Pack or OpenStack.
- Use any existing on-premises backup systems to protect AWS instances and deploy backup agents to EC2 instances and databases.
- Investigate multicloud automation software if using non-AWS infrastructure for a DR site.
- Configure native AWS databases to use built-in replication features to another region.
- Configure load balancers and DNS via Amazon Route 53 to redirect traffic to DR sites. Remember: ELB doesn't work outside AWS, so rely on DNS or a virtual appliance, like A10 or F5.
Understand AWS regions and availability zones
Achieve redundancy with these AWS tools
Use Amazon Route 53 traffic routing policies