Limitations And Other Caveats of AWS ELBs

aws.jpeg

This week, while load testing infrastructure for a large new customers, we faced some issues with the deployed ELBs and had an interesting conversation with Amazon engineers about the inner workings of ELBs and some limitations you may face. Thought interesting to share highlights here.

 

No Large, Sudden Spikes

This is a well-known one, but may still surprise some people, ELB has not been designed to stand large/sudden spikes of traffic. It may seem counter-intuitive (and in fact it is for me) that a service designed for scalability has this limitation, but it is a hard fact. 

In AWS words:

I noted your primary question is if ELB imposes a restriction on data/seconds. This is not the case and in fact the only real limiting factor of ELB includes large sudden spikes, as ELB is designed to scale due to load spread over a period of time. Large sudden spikes generally require pre-warms.

Given this I think explaining how ELB distributes traffic would provide some clarification. ELB uses a least number of waiting connections algorithm, once the request reaches the ELB node in the AZ. At the AZ level traffic is distributed using DNS in a round robin fashion.

Interesting. Please note the "pre-warm" concept. So when you expect a large/sudden spike of traffic, you just get on the phone with AWS folks and tell them to pre-warm your ELB and problem solved! ... just a small caveat here, you may not actually know when the large/sudden spike of traffic will come ... anyway, let's see what AWS have to say when asked about what a large sudden spike is and about this pre-warming thing:

"Large sudden spikes" means the sudden increase in the number of request to the ELB.Indeed this happens when you are load testing with a tool which sends large number of request to the ELB in a particular period.

Since you have sent large number of request as part of your load test, the ELB sent an error response as it was not able to handle the sudden increase in the incoming request.

About this issue :
  This is a common issue most of our customer face during load test without pre-warming the ELB.

What is Prewarming?
  Configuring the load balancer to have the appropriate level of capacity based on the traffic that you expect.
For pre-warming,We get the following details from the customer.
 1) Start and end dates of your tests or expected flash traffic,
 2) Expected request rate per second
 3) Total size of the typical request/response that you will be testing.

Now, we could have a Large/sudden Spike during a load test OR i.e. if a plane crashes and everyone rushes to our customers website (news outlet) to read the news. Pre-warming seems not an option here.

So we definitely wanted to understand how a Large/sudden Spike is exactly defined, so we just asked again. AWS answer:

>> "Large Spike", what does it mean, are we talking about users request being increased from let say 200 to 2000?
Honestly, I don't think there are clear cut numbers we can provide, however, an initial ELB size can handle probably a request count average of say 100 or so per second(these are my own best guesses, not official values). If you anticipate to have more than say 300 requests/second on a go, you may need the pre-warm that my colleagues talked about earlier to have the ELB initial size bumped up.

>> or how does it calculate and what would be considered as large spike ?
The design of the ELB is that there is a threshold beyond which the ELB scales up. For example if you create an ELB now it will be at its minimal size and once traffic starts flowing through it will scale up as traffic increases and at each ELB magnitude there is a defined threshold value. If the requests have surpassed the threshold value in a sudden behavior(spike) there will be a problem as the ELB scales gradually. You may be interested in the actual value of the thresholds and unfortunately I do not have the exact values. When an ELB scales from one level to another there is need to give underlying host a few minutes to reconfigure. To ensure that there is no outage during this scale up period we recommend the multi-AZ set up.

I wish I could give the exact figures for your planning(which I believe is very useful) but I will be lying and we may not be able to provide these as AWS(sadly). My advice would be if you anticipate some significant traffic whether slightly(as long as you expect higher than normal) please let us know and we can advise on the state of your ELB whether it can handle the expected traffic.

I am happy to clarify further should there be need to do so.

So this was really useful and frank advice from AWS people and gives very good insights on when using an ELB may not be a good option

 

Do not Load Test from a Single Source

Another interesting piece of advice we received was actually about how we were doing the test.

We were using a large M3 instance in another AWS region to generate the traffic (curl-loader). This seems problematic too. Let's AWS people clarify it:

If you are using a single point to conduct the load test this could route traffic to primary a single backend instance (at ELB level). It is advised to conduct load testing from multiple end points. To assist with this we have an article for best practices in planning for load test using distributed load with "bees with machine guns". Please reference the link below:

http://aws.amazon.com/articles/1636185810492479

We tested afterwards using blitz.io service, that generates the load from a swarm of different servers, and all went fine.

 

So thanks AWS Support for your very detailed explanations and hope this can help someone else!