An AWS disruption for the second time in one week has enterprises looking for answers as to the root cause of the...
recent DynamoDB glitches and what steps will be taken to ensure this doesn't recur.
The initial AWS outage caused by a problem with DynamoDB's metadata service Sunday didn't register much impact on the corporate radar as it occurred outside of business hours. While only a relatively narrow group of users who needed to launch new Elastic Compute Cloud (EC2) instances in the U.S.-East region were affected by the second problem on Wednesday, those affected were 'dead in the water' according to one engineer at a stealth startup in the Northeast who requested anonymity.
"[New] EC2 instances have a 100% failure rate since about 9:30 this morning," the engineer said. "Talking with AWS support confirms they see this problem."
Even enterprise IT users who weren't impacted said they find the repeated issues concerning.
"It is a bit worrisome to have two days like this in one week," said EJ Brennan, a freelance developer in Massachusetts who works with large enterprise clients.
A consultant working with large enterprises in New York also said his clients were as yet unaffected, but wants to know how the problem is being solved.
"This is unusual for them," said Mark Szynaka, a cloud architect for CloudeBroker. "Since it looks to be related to the same issue … I do want to hear a root cause analysis and what steps were taken to prevent this from happening again."
Updates on the AWS Service Health Dashboard Wednesday shed some light on the problems, but did not divulge the root cause of ongoing glitches with the DynamoDB metadata service or exactly what mitigations were taking place. Only that Amazon is investigating increased latency and errors for the DynamoDB metadata services and rolling out the remaining mitigations that have been developed to mitigate the errors we encountered earlier this week.
EJ BrennanFreelance developer
Meanwhile, also in Northern Virginia, there was a note on the Service Health Dashboard at 8:01 a.m. Pacific Time (PDT) that reported increased Elastic Block Store (EBS) API error rates, and increased errors for new instance launches in the U.S.-East-1 Region; a note on the Auto Scaling service in Northern Virginia issued at 8:25 a.m. PDT appeared to tie the problems back to DynamoDB once again.
Amazon did not comment by press time, but did provide some details on the causes of the Sunday morning AWS outage in a summary posted to its website. On Sunday, storage servers attached to the DynamoDB metadata service were impacted by a brief network disruption, but didn't recover as expected in part due to a new DynamoDB feature called Global Secondary Indexes which add more data to the storage servers, the post said.
"With a larger [data] size, the processing time inside the metadata service for some membership requests began to approach the retrieval time allowance by storage servers," the post said. "We did not have detailed enough monitoring for this dimension…and didn’t have enough capacity allocated to the metadata service to handle these much heavier requests."
The heavy load also meant it was impossible to non-disruptively add capacity to the storage server farm for DynamoDB, which led to outages in other services that rely on DynamoDB, such as EC2 AutoScaling.
Amazon said it had taken several steps to keep such errors from happening again, including increasing the capacity of the metadata service.
[UPDATE 2 p.m. ET] -- An Amazon Web Services spokesperson pointed to a specific part of the postmortem published this morning as the reason for the ongoing problems. The passage refers to ongoing support cases opened Monday in response to tables being stuck in the updating or deleting stage or higher than normal error rates.
"We did not realize soon enough that…some customers [had] disproportionately high error rates," the postmortem says. "The issue turned out to be a metadata partition that was still not taking the amount of traffic it should have been taking."
The postmortem said this issue had been closed out Monday; it now appears these issues have continued.
The best disaster recovery strategy for AWS outages.