Problem solve Get help with specific problems with your technologies, process and projects.

Paying the price when an AWS Auto Scaling group runs amok

An AWS Auto Scaling group continuously launched and killed instances, racking up extra charges for one company. Closely managed configurations can prevent the issue.

AWS Auto Scaling groups are a wonderful feature; the automated systems manage server downtime and scale services...

automatically for users. When attached to an Elastic Load Balancer, an Auto Scaling group makes it easy to ensure an application is always up and running.

Admins can specify that an Amazon Web Services (AWS) Auto Scaling group use the Elastic Load Balancer's (ELB) health check, which will make sure the service is running on the server -- not just that the server itself is running. This makes for a quick and automated replacement of any servers that are misbehaving, killing the bad servers and replacing them with good clean servers.

It's important to use ELB health checks, not just Elastic Compute Cloud (EC2) health checks. I've had issues where servers were still running, but the service on them had died and wouldn't restart. The ELB would disconnect from that server because it was no longer serving up requests, but the AWS Auto Scaling group didn't replace it because the server was still running. Eventually, all of the servers had the same issue and the service stopped working. Then I got an alert  from Pingdom notifying me that the Web service was not working. The AWS Auto Scaling group continued to think all of the servers were fine -- they didn't detect that the actual Web service had died and failed to restart.

It's best to use Scaling groups for every production service, even if they don't need to actually scale automatically. Most of my AWS Auto Scaling groups simply say, "Keep X number of servers always running." This means that if there is an issue and one of the servers dies, it will be killed and automatically replaced. It doesn't mean I have to increase the number of servers based on load automatically. But that makes it easier to automate some simple DevOps tasks such as rebooting a server.

What went wrong?

A little-discussed fact about AWS EC2 pricing is that users are billed for each server that runs for any partial hour it runs. That means if a user starts a server and then kills it within five minutes, he is still billed for the full hour. That seems acceptable, but if a user kills a server and replaces it with a new server of the exact same type and location, this move doubles the bill.

Initially, I launched a server, was charged for one server, killed that server after five minutes and then replaced it. But I was charged for 2x servers up until the one-hour mark after the first server was launched (Figure 1). When you compound that billing with an error in an AWS Auto Scaling group that constantly kills and relaunches servers, the costs pile up.

AWS Auto Scaling group configuration failure.
Auto Scaling kills an instance after five minutes and spins up another.

In my case, there was a problem with the Auto Scaling group configuration -- a server was continuously being killed and relaunched in the same region where there was an issue. This means that for every five minutes, a new server was launched and the old one replaced, resulting in a charge of 12 instance hours every hour -- even though there was only ever one instance running at any given time. And that one wasn't even working properly.

I didn't notice until our next billing statement came and there was an additional $1,200 in charges because of this. At this point, I contacted AWS Support. I was fairly upset when I found the issue, but Amazon fixed it and gave me a credit for the extra hours that my broken Auto Scaling group caused. AWS also examined the issue and gave me a credit for two months where the Auto Scaling group had been running out of control.

In hindsight, I should have set up a notification on the Auto Scaling group, and I should have verified that Auto Scaling actions could not happen more than once every 15 minutes. With those changes, there would have been, at most, four times the normal charges. This is still bad, but not as bad as 12 times. I should have verified that the servers in all regions were starting up correctly.

How to prevent Auto Scaling failures

First, subscribe to notifications on Auto Scaling groups -- even if it's just using an email address, as paging may be a little extreme. Administrators should also pay attention in case the group goes berserk. If something does happen and the AWS Auto Scaling group continuously spins up and replaces servers, an admin can disable an availability zone or prevent the group from executing any actions. Bumping up the "cool down" time to 15 minutes may also be a good idea to prevent a similar error from completely spinning out of control.

Lastly, make sure the ELB gives enough of a grace period after servers spin up before it determines it will not start correctly. If the service normally takes five minutes to spin up, give it 15 minutes. If the developer checks that he has at least two servers running behind the ELB, the running server should be able to handle the load while the new server is starting up.

It's always a good idea to provision extra capacity, as users may need to take down some servers while fixing the issue. Keep in mind that AWS Elastic Beanstalk uses Auto Scaling groups internally, so sign up for notifications on them too, if they're set up.

Next Steps

Allow your apps to scale with AWS Auto Scaling

Load patterns, performance behavior keys in Auto Scaling puzzle

Deconstructing an AWS bill

This was last published in September 2015

Dig Deeper on Amazon EC2 (Elastic Compute Cloud) management

Join the conversation

2 comments

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

What issues has your enterprise encountered with Auto Scaling?
Cancel
I liked your article but there is another way to rack up lots of charges besides the failure case you describe.  I've always wanted to setup an autoscaling rule that said "don't deprovision an EC2 instance if it has been running for less than 55 minutes of its current charging hour".   If you load varies considerable you can spin up an instance, then spin it down, then spin up another one..all within the same hour.  Once you've started an EC2 instance you should never terminate it until its close to the end of the hour.

I've asked but there does not appear to be a way to do this currently.  
Cancel

-ADS BY GOOGLE

SearchCloudApplications

TheServerSide.com

SearchSoftwareQuality

SearchCloudComputing

Close