Simple clustering with AWS and (free) RightScale

Here at StyleFeeder, we do a lot of things for the sake of performance. We recently decided to take a set of processes that we had running on a few large EC2 instances over at Amazon Web Services, and consolidate them into a couple of clusters.

First, you may ask, why use AWS at all? We don’t use it for everything, but this particular set of processes handles image storage, resizing, and delivery, through a content delivery network. We have many millions of images. If we wanted to stick them on a big file system, we would run out of inodes, so that’s out. We could put them in a sharded mysql database, and we have a big sharded mysql infrastructure for a lot of our data anyway, but that’s not how we started out, and it’s not exactly using the database for database-y things. To do this ourselves, we would have had to install a distributed file system of some kind, which seemed like a lot of work, so we decided to use AWS’s S3 for storage. Once your images are in S3, there are certain things it makes sense to do in EC2, since an EC2 instance can network to an S3 bucket pretty fast. Vendor lock-in? You bet, but hey, it’s giving us a pretty good value. To get the images out of S3 and on their way to our users’ computers and other shopping-enabled gadgets, we have some EC2 instances that resize them, store the resized versions for future use, and serve them up. The resizer/cacher logs indicate that although on average we don’t serve a given image up to too many users, we’re serving it for the Nth time, where N>1, about 96% of the time. If the CDNs could just keep them all around, forever, our servers wouldn’t be working as hard as they do. Our actual origin hit rates at the CDNs are something like 50%. What’s up with that? They can’t handle sparse sets of content? I don’t remember a disclaimer about that. Can you hear me, CDN people? I’m talking to you!!

But I digress. Back to AWS. AWS is a good value, of course, only as long as we’re efficiently using EC2 resources. That’s where the clusters come in. We’ll talk about this in terms of what you do in RightScale, rather than AWS alone. RightScale used to offer a lot of features that AWS just plain didn’t have. AWS has been filling those gaps, but we’re not planning to ditch RightScale any time soon, because they still make AWS resource management a lot easier than the raw services in the Amazon interface. If you’re a hard-core command-liner, this blog post and this manual tell you everything you need to know. We’ve gotten into the habit of doing these kinds of things with the free features of RightScale, until we get to the point where scripts save us real time or money. Here’s what we did:

  • Built up a medium-sized resizer/cacher, to do any and all of the miscellaneous things our various big boxes were doing.
  • ‘Bundled’ it into an ami image.
  • Created a ‘Server Template’ based on that image. Huh? Why do I need a ‘Server Template’? What’s wrong with the image? You can’t use raw images as the basis for servers in a ‘Deployment’ (see below). You need to have a server template. Does it have to be that way? I can’t see why, but hey, this is a free service (RightScale, that is), so who’s complaining?
  • Wrote some ‘Right Scripts’ so our server can start up and add itself into the mix in a fully functional state (starting apache and whatnot).
  • Created a ‘Deployment’ and added the server template to it.
  • Created more server templates, based on the same image, and added them to the deployment.
  • Created a ‘Load Balancer’, and registered the servers in the deployment with it ‘on boot’, and, for the ones that were already running, ‘now’.
  • Put a CNAME in our DNS for the load balancer. Huh? Couldn’t we just take an elastic IP address (one of those ‘permanent’ ips Amazon gives you) and assign the thing to that, so it takes over right away for the big old instance that was handling this? No, no we couldn’t. This is a pay service, and that sucks, but with a short time-to-live, it only sucks for a few minutes, so we’re going to overlook this. AWS seems to want to be able to scale the load balancer, or move it around, whenever they want, which I guess you might need in some circumstances. Note to people running java: watch out for the infamous java DNS caching problem, if you have jvms that talk to one of these load balancers. If Amazon switches the ip underneath you, your jvms will be talking to the void, unless you’ve configured the jre properly.

Now we can start all the nodes in our clusters, or just some of them, or whatever. If we’re feeling really ambitious, we can set these to auto-scale, but we’re already saving quite a bit of money and serving things faster than we were, so that will be for another day.

How is it deciding which node gets the traffic? The documentation seems to say that it’s round robin between availability zones, and then based on load within them. On this page Amazon says “Elastic Load Balancing metrics such as request count and request latency are reported by Amazon CloudWatch.” So, based on load, but measured in a black-box-y way. “Elastic Load Balancing automatically checks the health of your load balancing Amazon EC2 instances. You can optionally customize the health checks by using the elb-configure-healthcheck command.” You can do this in the RightScale interface as well. You can either accept the default that it checks for the presence of a TCP listener on port 80 (target=”TCP:80″), or you can give it something like “HTTP:80/path/to/my/image.jpg” that returns a “200 OK” when all is well. The default seems to work surprisingly well for these particular CPU-intensive activities that we’re clustering. We don’t see one server with a load of .3 while another is at 4. We do see some occasional differences, but they seem to even out pretty fast. We’ll be more precise if the differences start to get out of hand.

FoxyProxy Cloudera Config

When you have smashed your head into the table trying to get the included .pac file to work for Cloudera‘s EC2 Hadoop setup and want something that works properly in FoxyProxy, simply use the following URL patterns (available in text below the graphic for your cut/paste pleasure):

foxyproxy-cloudera

As promised,

*://10*
*ec2*.amazonaws.com*
*ec2.internal*

Moving to another cloud

We are in the process of migrating one of our backend dataprocessing servers from a legacy hosting company in NYC to Contegix.  What’s unusual about this transition is that we’re moving the machine onto Contegix’s new cloud platform rather to a traditional server.  We’ve noticed a few things already.  When we were copying over a huge backup of our databases, we noticed that they were transferring across the network from NYC to St Louis at 93Mbps, which is not frigging bad!  As I write this, we’re loading over 100Gb of data into a MySQL server on our new Contegix cloud machine at ~30K blocks/second (as measured by vmstat), which means that this thing has lightning fast i/o… not surprising since the storage is on an EqualLogic SAN (Update: we later saw this increase to ~70K blocks/second).

The differences between this cloud platform and EC2 (which we still use for some other needs) are striking.  The application that we will host on this new vm sometimes needs a lot of memory.  With Contegix, we can grow that all the way up to 128Gb with 32 cores.  Amazon doesn’t even come close to that – their max is 15Gb.  Or you can figure out how to distribute your application over a bunch of hosts.  But sometimes you just need 20Gb of memory and all the problems go away.  Plus we don’t have to compete for these resources – they’re guaranteed to us.

I also like the fact that the machine doesn’t disappear into oblivion when it reboots, which is a feature (?) of EC2 instances.  We can grow our storage needs past that point that I care to think about on this platform as well.  Plus, we get all the Contegix support that we want if we choose to do crazy things with this host.

The virtualization technology is VMWare ESX, which is darn cool stuff (having just set it up on an integration server here a week or so ago, I have to say that I like what I have seen so far).  We’ve already seen our VM get hot-migrated to another physical box in order to maximize the resources available to us.  Things got slow for a little bit, but then they got lightning fast.  I think we were copying data into the machine at that point and saw no impact to open connections, etc.  Don’t ask me why, but I’m still surprised that this works reliably.

So far so good.  We’ll report back with more later.

Warning of the Day

I would like to nominate this stack trace for the Warning of the Day award:

java.lang.NumberFormatException: For input string: "Fuck"
 at  java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
 at java.lang.Long.parseLong(Long.java:403)
 at java.lang.Long.valueOf(Long.java:491)
 at java.lang.Long.decode(Long.java:634)

It was caused by the following IP address from the request header of an actual StyleFeeder user:

Fuck.You.Iran.Government, 10.94.117.124

The latter IP address (masked here to protect the innocent) geolocated to the Netherlands, in case you’re wondering. Some kind of anonymizing proxy, maybe?

This user proceeded to view one item…

OiOi Sophisticated Baby Bags Giraffe Print Messenger

OiOi Sophisticated Baby Bags Giraffe Print Messenger

…then started the signup process, but unfortunately didn’t go through with it.

More about Iranian shoppers in a moment.

[Edit: The registration service is in the Netherlands, but the address was actually registered to a company somewhere else in the world (and not in Iran either). The challenge of accurate geolocation looms.]

final

I saw this blog post referenced on Hacker News today and thought I’d pounce on one of Stephan’s opinions, which is that the final keyword in Java shouldn’t be used except on fields. I can’t disagree strongly enough with this. I first came across this advice in the O’Reilly Hardcore Java book about five years ago. The book dedicates a whole chapter to the final keyword, which initially surprised me – I didn’t think that the author could possibly have that much to say about it! However, the main idea is “enforced documentation,” which is an argument that really sold me on the concept. I find that whenever I see final on a method parameter, a local variable or whatnot, I have one less question about how that object reference will be used, especially when I’m reading someone else’s code. I know that it was the developer’s intent that the variable in question be assignable or not. I find this to be immensely helpful.

Stephan’s argument is that it impacts readability, but I don’t find that at all. In fact, we make final part of our cleanup process for our default formatter in our Eclipse setups at StyleFeeder. Does final help prevent bugs? Occasionally, yes, but the cases for this are rare enough that I hesitate to put this forth as the main argument in favor of it.

The Perl-ish argument against this is to “avoid putting bars on the windows”, since you never know how things may need to be used in the future. However, that simply doesn’t hold up as a solid argument 99% of the time. The two common use cases for sprinkling final into your Java code are on method parameters (which you should normally not be assigning to) and local variables. In the case of local variables, they are local and the impact of putting final on them is entirely contained within that scope. I have yet to see a reason why this would yield any unwanted side effects. Of course, making classes and methods final is a Big Decision and not what I’m talking about here.

doing a tail on rotating log files

We learned something today, not something new, but new to us. If you do a tail -f to view the end of a log file, it all goes great until your logging system rotates out the file. Then you’re stuck wondering if the program has halted or if your ssh connection died or if you hit Ctrl-S and froze your terminal.

But if you use tail -F (note the big F) it will check to see if the file has been rotated out and another put in its place, and will resume tailing on the new file. Happy tails to you!

The wasteland between Harvard and MIT

While sitting at my desk, if I turn my head to the right, I can see the famous MIT dome and part of the roof of the Stata Center.  If I turn my head to the left, I can see parts of the Harvard empire.  StyleFeeder is located between two of the most famous universities in the world and within a stone’s throw of one of the birthplaces of the Internet (the other two being CERN and UIUC).  The amount of bandwidth running through the fiber cables buried beneath the sidewalk under my feet as I walk around Central Square makes this one of the most wired places on the planet.  Indeed, Forbes ranks Boston fifth in the US for “wiredness.”

One would think that the options for getting blazingly fast Internet access at the StyleFeeder offices would be plentiful and cheap, right?

I live one town over from Cambridge and I get very reliable, very fast (20Mbps down / 2Mbps up) Internet access via the monopoly cable provider available to me as part of a bundle that probably breaks out to ~$40-50/month.  It’s actually great.  I don’t have major complaints.

Zip back over to our StyleFeeder office in Cambridge and the best we have available to us is a crappy Verizon DSL connection (ostensibly-but-not-really 7Mbps down and something stupidly slow upstream) for $200/month.  Frigging wonderful.

Occasionally, Comcast drops off flyers advertising service in our building.  Our entire company is under strict orders not to let any Comcast employee seen on our premises leave until they can get us a Comcast rep on the phone who is both able and willing to sell us Internet service.  As much as we’d like to take up Comcast’s offer to pay them for reliable, fast Internet service, they historically have not returned our phone calls and generally ignore us. One of my friends who lives literally a block away from our office has good Internet service from Comcast in her house, so this must be possible.

In the meantime, I’m left staring at MIT wondering why on earth I can’t tap into the wires that are running underneath our building.

DNS performance redux

It’s no secret that we have spent a lot of time on performance at StyleFeeder, mainly because it’s one of those things that you end up addressing when you’re scaling, but also because it can yield very tangible results for user metrics.  It turns out that people really like using fast websites.  Go figure.

About a month or two ago, we looked in more detail at the chain of events that occurs when a browser hits our site to see if there were any possibilities for making things better.  This is increasingly hard for us to do given that we’re already serving stuff up pretty quickly, especially when you consider the size of our dataset and the sparse nature of our requests.

Anyway, we noticed some room to improve our DNS numbers, which I mentioned in a post last April.  Shortly thereafter, I met up with Jeremy Hitchcock from Dynect and he said that Dynect was definitely a lot faster than anybody else.  How could I resist a claim like that?  Especially if it was true, since it had some obvious benefits for StyleFeeder :)

So, I ran some tests using Pingdom for the entire month of June, 2009 that compared Dynect, Enom, DNSMadeEasy and Jerky.  Huh?  Jerky?  What’s that?  It’s our control point; more specifically, it’s a personal server owned by Eric: an old underused single Pentium 4 with ~2Gb of memory sitting on a Cogent network in a datacenter in Waltham, MA.

I mentioned my intent to do this testing to Dharmesh a few months ago and he gave me permission to test OnStartups.com and hubspot.com as part of my experiment, so that’s what I was testing against those services.

The short story is that I took the detailed logs from Pingdom, processed them with R and made some pretty graphs.  The raw numbers are here, but you can click through the graphic to see it in more detail.  Basically, I threw away the slowest 0.5% of the requests and made histograms of the frequency of the request times.  Note that the X axis has the same scale in all cases, so you can compare easily.

Click for the full DNS performance graphic

As you can see, not only does Dynect have the fastest response times by a wide margin, their times also exhibit a much smaller standard deviation than anything else that I tested (I should note that I didn’t test UltraDNS because they are stupidly expensive and I had a bad experience with them in the past, but they would be worth looking at if you are doing a bakeoff).

Enom’s DNS service is included with their domain registration service while DNSMadeEasy costs money.  All in all, Jerky did pretty well considering what it is.  Obviously, you don’t get redundancy out of a single box, so there are other features that the other DNS vendors give you that are very much worthwhile.

But the real news here is that Dynect demolished the other participants in this test with a 37ms response time, less than half that of the first runner up.  It’s not everyday that you can chop 40-70ms off of your mean response time, so when those opportunities arise, it’s definitely worthwhile.  Companies like StyleFeeder that have tons of new users who don’t have these DNS entries in their resolver caches will definitely benefit from the speedup.

If you don’t like my test, I’m happy to run it again under different conditions, so let me know if you think there’s a flaw in how I processed the data.

StyleFeeder on the 4th

Last year, Kilby took some great shots of the Boston fireworks and MIT selected one of them for use on their homepage this weekend.  The photo was taken from StyleFeeder’s office in Central Square in Cambridge and shows the famous MIT dome underneath some fancy pyrotechnics.  You can see more from this year on his Flickr photostream.

kilbys-pic-on-mit-edu

10 Steps to a Better Data Feed

Here at StyleFeeder we work with a staggering number of merchant data feeds from affiliate networks and other partners.  The data quality of these feeds varies quite a bit, sometimes by sins of commission (data where it doesn’t belong), sometimes by sins of omission (leaving out important information).  In an effort to get the word out, we’ve produced a top 10 list for retail merchants creating product data feeds.  This is not a comprehensive list but a quick overview.

1.  STOP YELLING!
It’s easy for us to capitalize your product names and categories for emphasis on the page, but very hard to do the opposite and get the original data back.

2.  Item names should say what they are
If they’re pants, put that in the name.  If they’re wedge sandals, put that in the name.  If it’s a notebook computer, put that in the name.  Especially if there are multiple items in the shot, like a belt displayed with the top and the pants.

3.  Keywords should be descriptive and product-specific
Not a repeat of the item name, not all the words from the long description with delimiters between, not keywords about the store that don’t apply to the product.

4.  The longer the description, the better
People browsing your affiliates’ sites want information, and the more you give them the more will click through and the more will buy.

5.  Categories should be set, and contain the relevant information
If things have gender relevance, include that in the category names.  If you sell different types of items, the category should reflect what type each one is.

6. Brand fields should be filled, and consumer friendly
Customers like to search and browse by brand, and we can’t do this if the field is blank.  Also, customers don’t know your brand with corp or inc or things like that tacked on the end, so fill your feed with the name they know.

7.  Use pricing fields in a standard way
If all of your items have a sale price filled in, it’s probably not a sale and it should probably be in the regular price column.  Save the sale price column for specials.

8.  If an item doesn’t have a working image, leave the image URL blank
404 errors, “image not available” images, store logo images.  If you can’t leave them blank, use the same “noimage.gif” type URL for all the broken ones so we can code around it.

9.  Use identifier columns
UPC when available, ISBN for books, Manufacturer Part Numbers that you’d use to order the item from the manufacturer.  And consistent SKUs for your own store that stay the same over time for the same item.

10.  Talk with your affiliates!

This goes without saying, but just like shoppers are the customers of your products, your affiliates are the customers of your feed.   We may have good ideas, we may have terrible ideas, but either way we may tell you something you haven’t thought of.