A string of power problems beset Amazon Web Services (AWS) last week at its Ashburn, Virginia data center. Culprits ranged from human error to a vehicle crash that took out a power pole. The company says it's plain bad luck, that it's working hard to keep things running smoothly, and that users should also protect themselves using AWS tools and services.
"Sometimes these incidents just bunch themselves together, and there's no explanation for the timing on these. We take operational performance very seriously," Amazon spokeswoman Kay Kinton said via email.
I think [these outages] are indicative of the almost infinite ways a complex system like a data center can be impacted.
Mitch Garnaat, author of the Python-based boto project
Kinton said the events were unrelated and that the failures affected a small subset of a single Availability Zone each time. "It's important to remember that we provide customers the building blocks to be resilient to any failure in a single availability zone," she said.
Following up on vows to keep users informed about operational issues, AWS released detailed reports about each incident.
The curious case of Amazon's many outages
On May 5, AWS workers were switching power from one utility substation to another one when a UPS failed to switch over to backup power and killed a rack of servers. Three hours later, "human error caused the backup generator to lose power, which resulted in the same subset of racks losing power." On May 9, a power distribution panel short-circuited and went into ground fault, an extraordinary equipment failure since industrial power panels are usually thoroughly overbuilt and rated against accidental shorts. AWS reported it had to spend extra time to make sure it could disengage the panel safely.
"Before restoring power to the impacted instances, facility engineering had to find and correct the ground fault…Restoring power without having taken this precaution would have put personnel at risk and run the risk of impacting the other hosts in this Availability Zone," said the incident reports.
May 9 was the longest outage of the week, lasting for about eight hours. And finally, a May 11 outage was due to a vehicle hitting a utility pole and cutting off power to the data center. Another piece of equipment, a transfer switch this time, failed and dropped out yet another set of racks for about 30 minutes.
Is AWS just unlucky, or is its data center on the fritz?
Many users were not pleased, but the affected number of customers was small.
"I think it's indicative of the almost infinite ways a complex system like a data center can be impacted," said Mitch Garnaat, long time AWS user and author of the Python-based boto project . Garnaat said that cloud providers need to let customers know that a situation like an equipment failure is under control, along with providing failover systems that work well.
Amazon said it is taking steps to redesign its power systems to reduce the number of servers in danger from a single equipment failure and added that the changes will be rolled out over the next several months behind the scenes.
It also asked users to learn about and operate some failsafe systems themselves. After the first outage, Amazon promised to redesign its power systems to make smaller numbers of servers vulnerable to a single equipment failure. By the second incident, it asked customers to be vigilant about their AWS environments and said users actually could have avoided any outage with the right layout.
"We also want to remind users to take advantage of the Amazon EC2 features designed to help…Applications architected across multiple Availability Zones are able to withstand instance failures within a single Availability Zone," said the incident reports released to the public.
Users may not like being told they should fend for themselves on disaster preparedness, but that appears to be part of the price for getting everything else AWS offers.
Carl Brooks is the Technology Writer at SearchCloudComputing.com. Contact him at firstname.lastname@example.org.