Simple clustering with AWS and (free) RightScale
Here at StyleFeeder, we do a lot of things for the sake of performance. We recently decided to take a set of processes that we had running on a few large EC2 instances over at Amazon Web Services, and consolidate them into a couple of clusters.
First, you may ask, why use AWS at all? We don’t use it for everything, but this particular set of processes handles image storage, resizing, and delivery, through a content delivery network. We have many millions of images. If we wanted to stick them on a big file system, we would run out of inodes, so that’s out. We could put them in a sharded mysql database, and we have a big sharded mysql infrastructure for a lot of our data anyway, but that’s not how we started out, and it’s not exactly using the database for database-y things. To do this ourselves, we would have had to install a distributed file system of some kind, which seemed like a lot of work, so we decided to use AWS’s S3 for storage. Once your images are in S3, there are certain things it makes sense to do in EC2, since an EC2 instance can network to an S3 bucket pretty fast. Vendor lock-in? You bet, but hey, it’s giving us a pretty good value. To get the images out of S3 and on their way to our users’ computers and other shopping-enabled gadgets, we have some EC2 instances that resize them, store the resized versions for future use, and serve them up. The resizer/cacher logs indicate that although on average we don’t serve a given image up to too many users, we’re serving it for the Nth time, where N>1, about 96% of the time. If the CDNs could just keep them all around, forever, our servers wouldn’t be working as hard as they do. Our actual origin hit rates at the CDNs are something like 50%. What’s up with that? They can’t handle sparse sets of content? I don’t remember a disclaimer about that. Can you hear me, CDN people? I’m talking to you!!
But I digress. Back to AWS. AWS is a good value, of course, only as long as we’re efficiently using EC2 resources. That’s where the clusters come in. We’ll talk about this in terms of what you do in RightScale, rather than AWS alone. RightScale used to offer a lot of features that AWS just plain didn’t have. AWS has been filling those gaps, but we’re not planning to ditch RightScale any time soon, because they still make AWS resource management a lot easier than the raw services in the Amazon interface. If you’re a hard-core command-liner, this blog post and this manual tell you everything you need to know. We’ve gotten into the habit of doing these kinds of things with the free features of RightScale, until we get to the point where scripts save us real time or money. Here’s what we did:
- Built up a medium-sized resizer/cacher, to do any and all of the miscellaneous things our various big boxes were doing.
- ‘Bundled’ it into an ami image.
- Created a ‘Server Template’ based on that image. Huh? Why do I need a ‘Server Template’? What’s wrong with the image? You can’t use raw images as the basis for servers in a ‘Deployment’ (see below). You need to have a server template. Does it have to be that way? I can’t see why, but hey, this is a free service (RightScale, that is), so who’s complaining?
- Wrote some ‘Right Scripts’ so our server can start up and add itself into the mix in a fully functional state (starting apache and whatnot).
- Created a ‘Deployment’ and added the server template to it.
- Created more server templates, based on the same image, and added them to the deployment.
- Created a ‘Load Balancer’, and registered the servers in the deployment with it ‘on boot’, and, for the ones that were already running, ‘now’.
- Put a CNAME in our DNS for the load balancer. Huh? Couldn’t we just take an elastic IP address (one of those ‘permanent’ ips Amazon gives you) and assign the thing to that, so it takes over right away for the big old instance that was handling this? No, no we couldn’t. This is a pay service, and that sucks, but with a short time-to-live, it only sucks for a few minutes, so we’re going to overlook this. AWS seems to want to be able to scale the load balancer, or move it around, whenever they want, which I guess you might need in some circumstances. Note to people running java: watch out for the infamous java DNS caching problem, if you have jvms that talk to one of these load balancers. If Amazon switches the ip underneath you, your jvms will be talking to the void, unless you’ve configured the jre properly.
Now we can start all the nodes in our clusters, or just some of them, or whatever. If we’re feeling really ambitious, we can set these to auto-scale, but we’re already saving quite a bit of money and serving things faster than we were, so that will be for another day.
How is it deciding which node gets the traffic? The documentation seems to say that it’s round robin between availability zones, and then based on load within them. On this page Amazon says “Elastic Load Balancing metrics such as request count and request latency are reported by Amazon CloudWatch.” So, based on load, but measured in a black-box-y way. “Elastic Load Balancing automatically checks the health of your load balancing Amazon EC2 instances. You can optionally customize the health checks by using the elb-configure-healthcheck command.” You can do this in the RightScale interface as well. You can either accept the default that it checks for the presence of a TCP listener on port 80 (target=”TCP:80″), or you can give it something like “HTTP:80/path/to/my/image.jpg” that returns a “200 OK” when all is well. The default seems to work surprisingly well for these particular CPU-intensive activities that we’re clustering. We don’t see one server with a load of .3 while another is at 4. We do see some occasional differences, but they seem to even out pretty fast. We’ll be more precise if the differences start to get out of hand.