Monday, February 11, 2013

Handling tmeouts with Amazon Elastic Load Balancer and autoscaling

Introduction

I've had to deploy scalable web service around application that our company developed. I've created a small Flask server to handler HTTP requests, pass through application and return the results. Now how to make it scale?

So here comes the lego:

  • Put Flask under WSGI container - uwsgi was chosen
  • Run that container on a single server with multiple processes/threads. I've also added nginx as a reverse proxy before uwsgi
  • Add servers when request rate goes up, remove servers when request rate goes down
  • Put this setup behind HTTP load-balancer to distribute requests
At the beginning Amazon Elastic Beantalk looked very appealing - like it has it all. But our application behind the flask is written in C++ and making it run on Elastic Beanstalk instance would be a serious challenge by itself; and using custom AMI basically defeats most of the benefits provided by this AWS service.

So I've went with a more custom scenario - AWS autoscaling group with my own instances and metrics + Elastic Load Balancer.

Bootstrap went smooth and the whole system was up in a couple of days. Instances went up and down depending on the load, but during stress testing we've run into weird connection timeout/drop issues that took a while to understand.

Now come the timeouts

Too many availability zones

It was tempting to enable all availability zones for the load balancer, although we keep only one instance online when the cluster is idle. I thought that load balancer will be smart enough to notice that it has only one instance and not try routing requests to other availability zones, that apparently do not have any of the servers. Well, for that matter, Amazon load-balancer is dumb - if you enable 5 availability zones, DNS end point will always resolve to 5 rotating IPs no-matter how many actually servers you have.

So the first advice is to make sure that number of enabled availability zones in ELB is at least as the number of minimum servers you keep online in scaling group.

Client app gets connections being closed in its face

The symptom was that our stress-load client got (random) "Got empty reply from server"-like errors from Python's urllib:
Failed connecting to web_api (): ('http protocol error', 0, 'got a bad status line', None)
On a server side we've got plenty of "SIGPIPE" errors from uwsgi:
write(): Broken pipe [proto/uwsgi.c line 145] during GET /.... (127.0.0.1)
IOError: write error

After pretty lot of digging, we found that connections are being dropped at two places:

First, ELB has internal "feature" of closing all incoming HTTP connections that do not get response within 60 seconds. I.e. if your client executes GET ... and waits for more then 60 seconds - ELB will close the connection. This timeout is currently unconfigurable through amazon API and even not advertised in Amazon docs. Rumors say that you can still adjust this timeout per load balancer by writing a mail to amazon support people.

Second "criminal" was uwsgi - it has default 30 second timeout for sockets it creates. I.e. after socket is idle for 30 seconds - kernel just closes it and when later uwsgi want to write long-awaited response to the socket - "oh my God! SIGPIPE! Socket closed! Client disconnected!". No, pal, I have news for you, its not a client, its your own socket you (uwsgi master process?) created. Fixed by passing --http-timeout 60 to uwsgi startup script (to actually match the timeout of load balancer).

And that is the story of load balancers, uwsgi and connection issues. Those default 60 and 30 second values are probably never noticed by majority of web apps that strive to get sub-second response times. Our application requests take long to process by their nature, so we hit them all.