AWS Auto Scaling groups are a wonderful feature; the automated systems manage server downtime and scale services...
automatically for users. When attached to an Elastic Load Balancer, an Auto Scaling group makes it easy to ensure an application is always up and running.
Admins can specify that an Amazon Web Services (AWS) Auto Scaling group use the Elastic Load Balancer's (ELB) health check, which will make sure the service is running on the server -- not just that the server itself is running. This makes for a quick and automated replacement of any servers that are misbehaving, killing the bad servers and replacing them with good clean servers.
It's important to use ELB health checks, not just Elastic Compute Cloud (EC2) health checks. I've had issues where servers were still running, but the service on them had died and wouldn't restart. The ELB would disconnect from that server because it was no longer serving up requests, but the AWS Auto Scaling group didn't replace it because the server was still running. Eventually, all of the servers had the same issue and the service stopped working. Then I got an alert from Pingdom notifying me that the Web service was not working. The AWS Auto Scaling group continued to think all of the servers were fine -- they didn't detect that the actual Web service had died and failed to restart.
It's best to use Scaling groups for every production service, even if they don't need to actually scale automatically. Most of my AWS Auto Scaling groups simply say, "Keep X number of servers always running." This means that if there is an issue and one of the servers dies, it will be killed and automatically replaced. It doesn't mean I have to increase the number of servers based on load automatically. But that makes it easier to automate some simple DevOps tasks such as rebooting a server.
What went wrong?
A little-discussed fact about AWS EC2 pricing is that users are billed for each server that runs for any partial hour it runs. That means if a user starts a server and then kills it within five minutes, he is still billed for the full hour. That seems acceptable, but if a user kills a server and replaces it with a new server of the exact same type and location, this move doubles the bill.
Initially, I launched a server, was charged for one server, killed that server after five minutes and then replaced it. But I was charged for 2x servers up until the one-hour mark after the first server was launched (Figure 1). When you compound that billing with an error in an AWS Auto Scaling group that constantly kills and relaunches servers, the costs pile up.
In my case, there was a problem with the Auto Scaling group configuration -- a server was continuously being killed and relaunched in the same region where there was an issue. This means that for every five minutes, a new server was launched and the old one replaced, resulting in a charge of 12 instance hours every hour -- even though there was only ever one instance running at any given time. And that one wasn't even working properly.
I didn't notice until our next billing statement came and there was an additional $1,200 in charges because of this. At this point, I contacted AWS Support. I was fairly upset when I found the issue, but Amazon fixed it and gave me a credit for the extra hours that my broken Auto Scaling group caused. AWS also examined the issue and gave me a credit for two months where the Auto Scaling group had been running out of control.
In hindsight, I should have set up a notification on the Auto Scaling group, and I should have verified that Auto Scaling actions could not happen more than once every 15 minutes. With those changes, there would have been, at most, four times the normal charges. This is still bad, but not as bad as 12 times. I should have verified that the servers in all regions were starting up correctly.
How to prevent Auto Scaling failures
First, subscribe to notifications on Auto Scaling groups -- even if it's just using an email address, as paging may be a little extreme. Administrators should also pay attention in case the group goes berserk. If something does happen and the AWS Auto Scaling group continuously spins up and replaces servers, an admin can disable an availability zone or prevent the group from executing any actions. Bumping up the "cool down" time to 15 minutes may also be a good idea to prevent a similar error from completely spinning out of control.
Lastly, make sure the ELB gives enough of a grace period after servers spin up before it determines it will not start correctly. If the service normally takes five minutes to spin up, give it 15 minutes. If the developer checks that he has at least two servers running behind the ELB, the running server should be able to handle the load while the new server is starting up.
It's always a good idea to provision extra capacity, as users may need to take down some servers while fixing the issue. Keep in mind that AWS Elastic Beanstalk uses Auto Scaling groups internally, so sign up for notifications on them too, if they're set up.
Allow your apps to scale with AWS Auto Scaling
Load patterns, performance behavior keys in Auto Scaling puzzle
Deconstructing an AWS bill