Happy 4th From StyleFeeder

green fireworks over MIT dome

Find more about jewelry store.

Happy 4th of July again from StyleFeeder. Once again we were able to use our great view to observe the Boston 4th of July fireworks display. Hope your 4th was as good as ours.

Hadoop for the lone analyst, part 2: patching and releasing to yourself

We left off, in part one of this series, at the point where we had Hadoop running with the Cloudera distribution, version 0.20.1+152. That’s Apache Hadoop release 0.20.1, plus 152 patches that Cloudera’s copious experience tells them they need for work in the real world. But perhaps we’re using, say, Hadoop streaming, and we read something like this, or any one of a dozen comments on the mailing list, which tell us we might need a patch that isn’t among the 152, and we can’t wait for Cloudera to have a need for that same patch. What then? Our approach will be to apply more patches and build the rpms again, so we can continue to use the nifty scripts, and take advantage of the tested base patch set, to which we have become accustomed. Once again, we will need to know a little about a lot of things, so a detailed HOWTO, assuming relatively little expertise in any one area, is called for.

To recap what we had done, and our current base technology stack, we had this:

  • A local CentOS 5 virtual machine, with some extra Python, jdks and other libraries, as detailed in part 1.
  • An Amazon Web Services account, some Cloudera scripts and configuration files, and the ability to run the hadoop-ec2 script to launch and terminate clusters, of pretty much whatever size we want.

We are using Hadoop 0.20.x, because 0.18.x is missing some features we want, and versions 0.21 and beyond are not fully baked. Not only are they not fully baked: as of version 0.21, the core of Hadoop has been split into three projects, and there isn’t even a script (well, there’s a script in patch form!) to build them all together and run them. That’s too much adventure for a lone analyst! Version 0.20 will do just fine.

Now we’ll need a few more things, starting with an rpm/yum repository. This is pretty easy. If you have never looked under the hood of these, look in /etc/yum.repos.d, at any of the .repo files. Grab a ‘baseurl’ link and look at it in a web browser. Look at a few existing ones to get a sense for the naming conventions, which mostly have to do with chip architecture: {noarch,i386,i586,i686,x86_64}. You will need a location to run an Apache httpd server, or something like it. We’re not going to explain how to do that, because it’s too big a topic, but it’s not hard at a basic level, and you can follow these instructions to set up password protection for the root of the directory tree where you’re going to put the rebuilt rpms. Even if you were the type of person who wanted to fork Cloudera’s distribution publicly for no good reason (and we hope you’re not that type of person!), you will typically have reasons that your favorite combination of patched features should not or cannot be released to the world. For example, we want to do this compression thing that Kevin Weil and the guys at Twitter figured out, and the licensing is reciprocal, and therefore not Apache-compatible. So for that and other reasons, we’ll want to keep our little Frankenstein to ourselves. If you know what you’re doing in this area of intellectual property, you won’t need these admonitions. If you don’t, please believe me: you won’t want the public shaming you’ll get, along with probably other nasty consequences, if you start forking and publishing willy nilly.

Final details on the rpm repository: you can use the gpgcheck=0 (i.e. do not check) option for gpg keys for now, what with the password setup. Not a good idea long term, but for the moment we’re focusing on getting this up and running. (Proper operation of cryptographic programs is another big topic we do not want to get into.) To get a certain URI to function as an rpm repository, you need to run the ‘createrepo’ command, with the file-system directory corresponding to that URI as its only argument. This should be somewhere under the level at which you’re password-protecting everything. We have a couple different private repositories going now, one for jdks, one for patched Hadoop. The ‘createrepo’ command creates metadata files about the rpms that are in the repository. Every time you add one or more rpms, run ‘createrepo’ again, with the same argument as before. It’s not the most aptly named of commands. It should be called something like ‘create-repo-or-overwrite-repo-metadata’.

Now we have a destination for our rpms, and we need to create them. Which ones? It depends on what you’re using. If we go to a running Cloudera-style EC2 Hadoop cluster and do something like this: “yum search hadoop | grep hadoop-0.20 | cut -d’ ‘ -f1”, we get a full list, for which we would be able to find the rpms at this location, like so:


Doing ‘yum info’ on all of those, we find that by default the name node only has the first item, and the slave nodes also have conf-pseudo, datanode, and tasktracker. Those are all ‘noarch’ packages, so unless we have a reason to use pipes, native, debug, or libhdfs, we won’t need specific chip architectures to get this working.

If you’ve never worked with rpmbuild, it’s not very difficult. There are lots of HOWTOs. Long story short, as root you need to do “yum install rpm-build”, and then as yourself (DO NOT DO THIS AS ROOT), create a file called ~/.rpmmacros containing something like this:
%_topdir /mnt/usr/rpmbuild
%_tmppath %{_topdir}/tmp

Inside whatever you have for %_topdir, “mkdir {BUILD,INSTALL,RPMS, SOURCES,SPECS,SRPMS}”, and inside %_topdir/RPMS, “mkdir {noarch,i386,i586, i686,x86_64}”.

Next, you will want to see if you can add a patch and rebuild the Cloudera distribution. One thing at a time. Let’s rebuild the distribution with no changes, and make sure that works. Find the download link from this page (currently this). You need a 64-bit CentOS 5 for this. Remember I said we would have a use for that in part one? 64-bit anythings are memory hogs, so we do this on EC2. On our hadoop builder machine, my .bashrc file looks like this:

# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
export JAVA_HOME=/usr/lib/jvm/java-1.6.0-sun
export JAVA32_HOME=/usr/lib32/jvm/jdk1.6.0_14
export JAVA64_HOME=/usr/lib/jvm/java-1.6.0-sun
export JAVA5_HOME=/usr/lib/jvm/java-1.5.0-sun
export FORREST_HOME=/mnt/usr/apache-forrest-0.8
export ANT_HOME=/mnt/usr/ant/apache-ant-1.7.1
export PYTHON_HOME=/usr/lib/python2.5
export ECLIPSE_HOME=/mnt/usr/eclipse/eclipse-europa
export JDIFF_HOME=/mnt/usr/jdiff-1.1.1
#export XERCES_HOME=/mnt/usr/xerces-c/xerces-c_2_8_0-x86-linux-gcc_3_4
export XERCES_HOME=/mnt/usr/xerces-c/xerces-c_2_8_0-x86_64-linux-gcc_3_4
export FINDBUGS_HOME=/mnt/usr/findbugs-1.3.9

That list includes all the libraries you need to build the regular Apache project. You will have to find them on the internet and download them. Look at the .spec for what else you’ll need: you’ll need git; you’ll need to do “yum install ant ant-nodeps”. This will install (for now), ant 1.6.5 at /usr/bin/ant. Note that this is not the ant we are using for the build. We’ve been very naughty and just moved /usr/bin/ant to /usr/bin/ant-1.6.5, so we can have /usr/bin first in our PATH some of the time, but always get the ant we really need, and also have rpmbuild work. Yum will whine at us at some point in the future, but we are not going to worry about that for now. Is that enough of a web of dependencies for you? It’s a good thing we’re doing this on EC2, so we don’t have to live with this slightly freakish box too much of the time.

You should also download the source distributions of automake 1.10 and autoconf 2.60. The system automake and autoconf on CentOS 5, for now, are 1.9 and 2.59 respectively, and various parts of the Apache and Cloudera builds/rpms need one or the other of these pairs of tools. It’s easy: untar the source file, cd into the directory, “./configure; make; make install” (as root). This will deposit the more advanced versions of automake, autoconf and aclocal in /usr/local/bin. When the builds complain at you for something in the autotools, rejigger your PATH by putting /usr/bin or /usr/local/bin first.

So now we have an intact base CentOS 5 system, with alternate jdks (Sun 1.6 64 bit, Sun 1.6 32 bit, and Sun 1.5), alternate Python (2.5, could be 2.6) and alternate GNU autotools. The Cloudera guys helped us out here when we got stuck while trying to do this on Fedora 8. That’s an example of what we meant in part one, about rat holes you could go down, if you don’t want to follow these directions step by step. I’m fairly confident this all could be made to work on Fedora, or Debian, or Ubuntu as well (and certainly with less messing around with Python), but now that we have it working on CentOS, we’re not going to try any of that. It turns out CentOS 5 is what Cloudera uses to do their builds, and that is good enough for us.

Get a shell with an environment as in the .bashrc file, set an environment variable in your shell called “FULL_VERSION” to whatever the full patched version number is (in our case this would be “export FULL_VERSION=0.20.1+152”). Cd to the untarred distribution directory, and do cloudera/do-release-build. You may have to monkey with your PATH, as indicated above. If you don’t set FULL_VERSION, it will build, but there will be an odd directory in the build tree, and rpmbuild will complain later because you don’t have the Sqoop documentation. So remember to do that. When you have the built source distribution file, you are ready for the rpm work.

Download the source rpm from here. As yourself, do “rpm -i hadoop-0.20-0.20.1+152-1.src.rpm”. That will unpack various files into the subdirectories of %_topdir. Copy your built source distribution file over what it puts in %_topdir/SOURCES. cd to %_topdir/SPECS, and check out hadoop.spec. The command you need to run is “rpmbuild -ba –target noarch hadoop.spec”, but it didn’t just work for us. Our java/jdk package names (remember the jpackage process we mentioned in part one?) are not the same as Cloudera’s, so we had to edit the dependency/prerequisite “jdk >= 1.6” to be something that fit with our naming convention. We had some problems with a Python rpm script that wanted to use the system Python, and wouldn’t work if it ran in a shell that had the PYTHONHOME and corresponding PATH item from our .bashrc settings as above. So we commented out the PYTHONHOME thing, which we had needed for running the cluster and building, but now foils us on the rpm side. There’s a lot of re-arranging your PATH in this process: it’s kludgy, we know. But we got a new shell and it just worked, and built all the rpms we need, under %_topdir/RPMS. Awesome!

Now you’re ready to patch, so go back to where you were doing the building. We made a recursive copy of the untarred distribution directory, and cd’d down into that. Under the root level there is a ‘cloudera’ directory, with some scripts, documentation, and all the patches they have applied. We could have re-applied all the patches from scratch to a virgin Apache distribution, but we really just needed to add one thing the first time, so we downloaded that patch file, copied, renamed and modified their apply-patches script (let’s call this ‘apply-one-patch’), like so:

#!/bin/sh -x
set -e
if [ $# != 3 ]; then
echo usage: $0 ' '
exit 1
TARGET_DIR=`readlink -f $1`
PATCH_DIR=`readlink -f $2`


# We have to git init...
git init-db
for PATCH in `ls -1 $PATCH_DIR/$PATCH_FILE` ; do
git apply --whitespace=nowarn $PATCH

and then ran it with our patch file as the last argument. TARGET_DIR needs to be the ‘src’ directory under the untarred distribution root. Plain old unix ‘patch’ would no doubt work as well. Now we need a naming convention for our version number. Guessing that things will just work with increments, on both the building and launching side, we go with “export FULL_VERSION=0.20.1+153”. When I’m not a busy lone analyst with patterns to find, I’ll read those scripts and figure out some more sensible thing, but I’m trying to get back to work here! The last official release increment from Cloudera went from +133 to +152, so we’re not too worried about a future name collision. Then we execute ‘do-release-build’, and it seems to work. If you’ve done all this, then under the ‘build’ directory, there should now be a gzipped tarball of your new distribution.

Next, we copy that tarball into the %_topdir/SOURCES directory, and then go back into %_topdir/SPECS. Editing hadoop.spec, we tell it our new version number. Then once again “rpmbuild -ba –target noarch hadoop.spec”.

Now everything should be built, so you can copy the rpms to the ‘noarch’ subdirectory of our rpm repository root on your password-protected http server, and run ‘createrepo’ again over there.

At this point, you just need a way for a cluster in the process of launching to find that repository. You can either put a .repo file in a private ami (your password would need to be in that .repo file, so the image would need to be private), or make it part of the script to put that file everywhere as the cluster is launching. Locally, edit your ec2-clusters.cfg file if needed, and fire away. Do “hadoop-ec2 login mycluster” and do “yum info hadoop-0.20”. You should see a “hadoop-0.20.noarch” package with your new version number.

The big data world is your oyster, and now you can fine-tune your tools. We hope this helps our fellow lone analysts out there.

Better Categories in Data Feeds

Category data is useful in creating navigation structures for affiliate sites that consume data feeds.  Sadly, there is a major lack of standards and practices in the filling in of category fields.  Here are a few pointers:

1. Don’t reinvent the wheel

There are examples of rich category hierarchies out there, be it Amazon/Zappos/PopShops/GoldenCAN that can be leveraged as a starting point.

2. Work from top to bottom

Even if your store is specific to a vertical, include that vertical at the beginning of your categories.  Longer and more descriptive is better, just work from general to specific.  Think Shoes>Women>Sandals>Slide rather than just Slides.  It avoids ambiguity and is much richer.  Slides could also be found in a playground.

3. Make your levels consistent

If your store is 90% womens’ products and 10% mens’ don’t have rich categories for the former and just Mens’ for the latter.  Make the hierarchy the same on both sides, even if the latter may be missing categories (Apparel>Men>Dresses is not so commonly found as the reverse).

4. Think about the customer

If your item is specific to Women/Men/Boys/Girls/Toddlers/Infants/Dogs/Cats/etc put that in there, so segmented affiliates can extract what they need and things can be tagged appropriately.

5. New! and Sale! are not descriptive of the products

Having a separate new products feed, sale products feeds, top sellers feed, or other specialized feeds is a fine idea.  So is offering RSS for specials and freshness.  But using those things as the only category for an item does that item a dis-service.  The item’s category no longer describes what the item is.  These should therefore be avoided.

6. Don’t leave them out

Even if you only sell one sort of thing, put the same value for the category for all of them.  More data is always better than less.



Here at StyleFeeder, we spend a lot of time figuring out what our users are doing, and trying to figure out what they want. One of the tools we have brought to bear on these questions is Hadoop. Among the technical tools these days, Hadoop is like the prettiest girl in school, and it’s easy to think you should be bringing her to every conceivable dance. You shouldn’t: there are plenty of problems that Hadoop can’t solve, or for which there are better tools. But there are some problem spaces where it excels: web analytics and preparation for search, to name two. This post is informed by our use of it for web analytics.

This is a long piece, but I figured we might as well get this all up in one place. To skip straight past the blather and into the HOWTO, go here.


One of the great promises of Hadoop is that an analyst working alone can operate what formerly required a dedicated cluster with special-purpose analytical software, and a team of sysadmins and systems programmers who know C and possibly Assembler. This analyst is not an end-user, in the traditional sense, but a person who knows some unix and a scripting/declarative language or two, say, Python and SQL. Such a person, in a big company, used to have to requisition hardware in, say, January, and try out maybe one pet theory, if he’s lucky, in April. Now he can try out an idea in a week, or a day, or a few hours. He can try out his theories almost as fast as can think of them. That’s what we’re after. Within a defined but still very large problem space, Hadoop seems to deliver on that promise, and we’re pretty excited about it. We don’t have red tape like that, but we don’t want to spend a lot of money on a team of cluster operators either.


At StyleFeeder, our richest sources of information about user behavior are MySQL tables full of events, and web server logs. We could dump these all into a MySQL database, but MySQL is nobody’s favorite choice for a data warehousing platform, in part because the joins and the aggregate functions are so slow and primitive. We could dump everything into a database that’s better suited to analytics, but that would probably involve license fees and a commitment to a finicky platform with a much more particular sweetspot in terms of problem space. We’re not really sure where we might need to go with these analytics, so we want to keep our options open. Hadoop is a sort of Swiss army knife for big data, whether structured or not, so it looks like it will do for now. Finally, we don’t have to lease or buy a bunch of hardware: we can run everything on Amazon Web Services EC2 or EMR (Elastic Map Reduce), and just pay for what we’re using.


The rest of this post will be very specific, and include a lot of “obvious” things. You need to know a little about a lot of things to get the setup just right, and we’re conceiving of the audience for this as the analyst who doesn’t necessarily know how best to install all the software he can use competently. So bear with us on all that. This is really a catalog of the things we wish we had had somebody to explain to us, as we fumbled our way through.

We used Cloudera’s scripts to launch an EC2 cluster, and ran some “Hello, world”-type programs. We counted the incidence of all the words in Shakespeare and the Bible–it works! We subscribed to a few Apache Hadoop mailing lists. As you might expect in a very popular, pre-1.0 project, it’s a community in a state of rollicking ferment. Note to self: we may need to apply some patches, if we run into trouble down the road. Is that a problem? Short answer, no, but it’s not going to be perfectly straightforward.

These Cloudera Python scripts have been contributed to the trunk of Hadoop, but they are not yet in the 0.20 or 0.21 branches. They are somewhat generalized, in that form, relative to the versions we are using, so for example the configuration directory is called “.hadoop-cloud” rather than “.hadoop-ec2”, with associated actual changes. In this discussion, we’re talking about the separately released Cloudera scripts, which have been around for a while. There are also bash ec2 scripts that seem to peter out at version 0.19 in the Apache distribution, no doubt because they have been superseded by the Python scripts. The version we’ve been using is available here: http://archive.cloudera.com/docs/_getting_started.html.

There’s a lot of talk on the Hadoop mailing lists about administering what are obviously dedicated Hadoop clusters. These are people who are much more committed to running Hadoop than we are, and who have it running constantly. That’s not really what we’re after, so we’re going to pay attention to what they’re doing, but not really do it the same way.


This will probably get out of date quickly, but here goes. One of the mildly frustrating things about doing all of this has been that most of the examples out there are for version 0.18.3 and lower, and they don’t “just work” with more evolved versions of Hadoop. There’s a dearth of HOWTOs for version 0.20 and beyond, so this will be one.

  • Hadoop runs on a few platforms, but most of the action in the community is on Debian/Ubuntu or Redhat/Fedora/CentOS. Let’s stick with what we’re going to be able to get help with. If you want to do exactly what we did (and trust me, diverging in small ways from this can lead you down some rat holes), start here: have a local CentOS 5 machine, or (as in our case), a VMWare-based virtual machine running CentOS 5. 32-bit will do (that’s what we have), but we will have some uses for a 64-bit machine. In our case, we run a 32-bit machine locally, and fire up an EC2 instance when we need to do some 64-bit builds.
  • Install python 2.5 or 2.6. CentOS 5 needs the system-installed rpm of /usr/bin/python 2.4, or the package manager will complain. There are a couple of rpm packages out there for python 2.5 or 2.6 on CentOS 5. Long story short, they didn’t work for us. As root, download the Python source to /usr/local/src; ./configure –prefix=/usr/local/python2.5; make; make install. In your own (non-root) .bashrc or equivalent, set PYTHONHOME to /usr/local/python2.5, and PATH to $PYTHONHOME/bin:$PATH.
  • Install boto and simpleson, which are Python libraries. easy_install simplejson seems to do the trick, at least for now, but the EC2 launcher scripts, although they say “boto 1.8 or higher” will complain like stuck pigs if you get boto 1.9. So download the source for boto 1.8, temporarily set root’s environment (PYTHONHOME, PATH) as above, and do ‘python setup.py’ build and install.
  • If you don’t have one, set up an AWS/EC2 account, and get your ssh keys sorted out for that, so you can shell out to your EC2 instances.
  • The README in the EC2 launcher scripts tells you in detail how to configure your local environment to launch a cluster. Use the fedora ami that Cloudera makes available. If you want something special for your data processing, like setting the timezone to Central European Time, or whatever, you can launch an instance off that ami, make whatever changes you need, and bundle it as a private or public image for yourself. Then change the id of the ami in the config file. The Cloudera amis have Sun java 1.6 on them, which you will need. Hadoop is not pre-installed, but rather installed by the scripts from their yum repository. We have our own internal procedures for installing java, involving the jpackage process, and if you’ve got something of your own, you can probably just launch an appropriate image, install a couple of things and be off to the races. YMMV. The Cloudera images work fine.
  • When you have a cluster configured in a local text file (let’s call it ‘mycluster’, for future reference), do “hadoop-ec2 launch-cluster –env REPO=testing –env HADOOP_VERSION=0.20 –user-packages ‘s3cmd lua lua-devel lzo-devel lzop’ mycluster 3”. Supply whatever user packages you want. It’s going to run “yum -y install packagename” on all the nodes. There are reasons for the ones on that list, which we’ll see later. This will launch a name node (the central controller) and 3 slaves. Since the data in Hadoop is replicated 3 times by default, this seemed like a good test cluster size. Why not run locally? We did, for a while, and we still do sometimes, but certain things just don’t work the same. If you’re new to Hadoop, you can get very frustrated by trying out things that work locally in tests and then fail in the cluster. Try out a small cluster until you get the hang of it. For tests, use an instance size of m1.small, but switch that to at least c1.medium when you start crunching real data. Why version 0.20? Version 0.18.3 is the stable one right now, but by last December version 0.20 was already close to being declared stable by Yahoo, and lots of people have it in production. You get Sqoop, you get more advanced versions of Hive and Pig, and the API is closer to what it will be going forward. Odds are that 0.20 has some features you’ll want.
  • Try out a few things. First, from your local box, run “hadoop-ec2 update-slaves-file mycluster”. This will update a file called $HADOOP_HOME/conf/slaves on the name node. Then do “hadoop-ec2 login mycluster”. This will give you a shell on the namenode. Then do “slaves.sh -o StrictHostKeyChecking=no uptime”. This will get ssh’s known-host checking out of the way once and for all (or at least until you terminate this cluster and launch another one). From this point on (on this namenode), you can do “slaves.sh uptime” to see how hard the slaves are working. Go into the /etc/passwd file and change the ‘hadoop’ user’s shell from /sbin/nologin to /bin/bash. Then do “su hadoop -“. Now try “hadoop dfsadmin -report”. You’ll see how many nodes you have, and how much disk space is available. You’ll want to keep close tabs on that, when you start slinging real data around. “hadoop job” will show you a few useful commands, that will allow you to see how far the jobs are from completion. There’s a web interface, but we find ourselves frequently having to restart Firefox and a proxy to get that to work. We’re probably just doing something stupid in that area, but the command line always works, so we’ve gotten used to relying on that.
  • Inspect things a bit more. $HADOOP_HOME (/usr/lib/hadoop, if you’re following along with these procedures) has files that will show you that what you’re running is (for now) Hadoop version 0.20.1+152, which means Apache Hadoop version 0.20.1, plus 152 patches that Cloudera has applied or backported to the Apache release. Thank you, Cloudera! Follow the trail in /etc/yum.repos.d/cloudera-testing.repo, and unpack the rpm if you’re curious. Just reading the list of what’s involved makes you think you wouldn’t want to be running this without them.
  • Try streaming, or Pig, or Hive, or java-based map-reduce. A trivial streaming example will be the easiest to get working. Take a text file, put it in hdfs (“hadoop fs -copyFromLocal filename.txt /user/test/input/filename.txt”), and use “cat” as your mapper and “wc -l” as your reducer.
  • Modify your environment a little, to make things tolerable. My .bashrc contains this section:

    alias hcat='hadoop fs -cat'
    alias hcp='hadoop fs -cp'
    alias hget='hadoop fs -get'
    alias hgetmerge='hadoop fs -getmerge'
    alias hls='hadoop fs -ls'
    alias hmd='hadoop fs -mkdir'
    alias hmv='hadoop fs -mv'
    alias hput='hadoop fs -copyFromLocal'
    alias hrm='hadoop fs -rm'
    alias hrmr='hadoop fs -rmr'


Cloudera has a great example of web server log file processing with Pig and streaming here: http://www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/. It illustrates a bunch of very useful techniques, including streaming with Perl and Perl modules, basic Pig, and Pig User-Defined Functions. There are two things that don’t work about it. We spent some time getting it to really work, to make sure we understood what all was involved, and we were better off for it.

First, a typo/omission. This line:

DEFINE iplookup `ipwrapper.sh $GEO`
ship ('ipwrapper.sh')

needs to have a ‘ship’ statement for ‘geo-pack.tgz’ (available on that page for download), like so:

DEFINE iplookup `ipwrapper.sh $GEO`
ship ('ipwrapper.sh')
ship ('geo-pack.tgz')

/user/tmp is an hdfs location, and those files need to be in the directory you are launching the script from. You’ll need to set this up by taking the GeoLiteCity.dat file they link to and putting it into hdfs (hadoop fs -copyFromLocal). Everything you need for the DEFINE needs either to be there in hdfs already (‘cache’, which then makes it available locally to every node), or to be sent from a local file system location as you run the script (‘ship’).

This example also uses the Piggybank, which is a collection of user-defined functions for Pig. If you’re oriented towards Hive, you may not want to bother with this example. Since we’re a java-enabled group, Pig is a reasonable tool for us, and the Piggybank has some very useful functions, beyond what’s built into Pig itself. The Cloudera distribution comes with Pig by default, but not the Piggybank. So we checked out branch-0.4 of Pig, built the Piggybank, and then started extending and adapting functions from it. For example, our Apache logs have an extra field at the end, so the combined Apache log parser was useful, but needed an extension to work for us. The DateExtractor thing is there in the PiggyBank and works, but the IsBotUA function, as the author says, has not been contributed yet, so that was a good exercise for us. After detailed study of our web logs, we have our own ideas about what constitutes a bot anyway, so we would have had to modify whatever they might have done.

While just playing around, we did our extensions directly in the Apache tree, but eventually we set up a whole java/Maven dependency thing, because we have a bunch of other java that we’ve built up over time, that we want to apply to our logs. So we combined that existing code base with our new Pig examples, set up a “mvn compile assembly:single” sort of build process, and took the resulting fat jar (jar with all the dependencies included) for the Pig ‘register’ statement. That gives us our code, with the Piggybank included, for easy reference in Hadoop/Pig scripts and MR jobs. In either Pig or Hive, which are basically SQL-query-optimizer-like (Pig) and SQL-like (Hive) interfaces to map-reduce, you can use built-in keywords to take advantage of scripts that will be integrated into the processing through Hadoop’s streaming API. Pig’s ‘stream’ and Hive’s ‘transform’ play comparable roles, within the shells or scripts. We haven’t really gotten into Hive, just because we have such a big java code base, and it seemed more natural to hook in through either raw java MR or Pig. If we were more pythonic or perlish, we’d probably be doing MR in Python/Perl, and using Hive for joins. Hive has a ‘JOIN’ keyword, as does Pig, and Pig’s COGROUP + FLATTEN gives you much the same thing as well. A declarative join is one of the most helpful things that Pig and Hive provide, but if you don’t like theirs, you can just write a special-purpose one yourself in MR, whether streaming or not. It might seem, because of the basic looping constructs, that Pig has more of an actual programming language built in than Hive, but that’s not really true. Even to do simple loops or list processing of anything beyond the lists that Pig can create for you, we have had to create a few Python scripts that call Pig. This is by design on the Pig side, according to mailing list chatter: they don’t want to be re-inventing the wheel. Cloudera’s example shows how you can call Perl/Python/Bash/whatever scripts from inside Pig, when you need to. We’re not trying to knock Pig here–it does a lot of useful things for us. We’re just pointing out that it won’t answer all your questions on its own. You’ll get used to cobbling together whatever pieces are comfortable for you pretty quickly, if you have a reason to. If you have been processing log files for a while with Perl, Python, C, Java, or Language X, there are typically ways to adapt what you have, and plug it into Hadoop.


There are some great tutorials out there for processing Apache logs with Hadoop. We went about looking at our logs in the following way.

  • We grabbed some of our Apache logs from S3, where we keep them. These were in .bz2 format, and that requires a little work. The main configuration file for Hadoop (although it’s now deprecated because it’s been split into three separate ones), is called hadoop-site.xml. You will get a message about this deprecation constantly. Ignore it. The config is embedded in the hadoop-ec2 scripts, in the file hadoop-ec2-init-remote.sh. Search for “Gzip” and you’ll find the line that sets up the compression handlers. Add “,org.apache.hadoop.io.compress.BZip2Codec”. The next cluster you start up will be BZip2-enabled. To make configuration changes on running clusters, we have been editing the file on the name node and then doing something like “slaves.sh scp user@namenode:/usr/lib/hadoop/conf/hadoop-site.xml /usr/lib/hadoop/conf/hadoop-site.xml”, which seems to work fine. The BZip2 Codec didn’t actually work for us, though, so we re-zipped the files with gzip.
  • A word on compression in Hadoop: it just works, behind the scenes, and Hadoop takes care of it for you. To prove this to yourself, try a cat/wc example, as above, with a gzipped text file. That doesn’t work with zcat, but it works just fine with cat. There are a lot of nuances, and fancy ways you can set up compression with different codecs, but your gzipped files are usable in Hadoop, and that was good enough for us, to get started.
  • A compressed daily file for a single web server for us is smaller than the default hdfs block size of 64MB. Processing files that small apparently leads to all sorts of suboptimal behavior in Hadoop, so when we started to get mysterious errors on 3 months worth of log files, which weren’t showing up on a single-file test, we catted a bunch of compressed files together, so they would be closer to the size that Hadoop wanted. The failing jobs just started working! Yikes! Rollicking ferment, like we said, in the community, apparently means a certain amount of trail and error for us! In the end we have an etl job taking our Apache logs and putting the partially transformed data into tab-delimited format. We’re doing that with a regular expression wrapped inside a Pig UDF, based on the one in the Piggybank. If you prefer to work with Python and Hive, or whatever, presumably you can plug in the same regex there.
  • There are ways to make Hadoop use an ‘S3’ file system, or an EBS volume. We’ll try those when we’re feeling adventurous. For now we just transfer the files from S3 with s3cmd, and load them into hdfs. We figure we’re getting smoother and faster processing afterwards, at the cost of longer startup for our cluster. That’s just a hunch that we haven’t tested, though.
  • While we were at it, we wrote some more Pig UDFs to do things like turning Apache date formats into seconds since the UTC/GMT unix epoch, so we could compare records in the Apache logs with records from the database, and have our session/visit analysis hold up on Daylight Savings Time change days, without too many coding gymnastics. MySQL date formats are of course missing any stored notion of time zone, so the best way to deal with them in MySQL is to record everything as some sort of longish int, before or after the epoch. Apache logs typically do not have this limitation, but we’re going to the lowest common denominator.


There are some different ways to get data from MySQL into Hadoop. The Sqoop (sql-to-hadoop) project (in the contrib area of mapreduce) provides ways to do this with both jdbc and mysqldump, if you have a running MySQL instance. You could certainly get Sqoop to do what we did, but we actually went about this in a slightly different way. We are using Hadoop as a batch processor, and we already had our mysqldump files archived in S3. When we want to do longitudinal studies of any kind, we need a lot of that data. We didn’t spend much time thinking about standing up a MySQL that Hadoop could talk to, because that was clearly going to be a bottleneck, but rather decided to parse the stored mysqldump files (h/t Aaron Kimball of Cloudera for this suggestion). How to do that? A quick Google search led us to ‘mydp’, a little C program with configuration scripts in embedded lua, written by a fellow called Cimarron Taylor, here: http://cimarron-taylor.appspot.com/html/0901/090116-mydp.html. With the exception of this part of his Makefile:

CFLAGS = -g -I/usr/include/lua51
LIBS = -lfl -llua51 -llualib51 -lm -ldl

which we needed to change to

CFLAGS = -g -I/usr/include
LIBS = -lfl -llua -lm -ldl

this compiled and ran beautifully on CentOS as is. He’s developing on a Mac, and the references to lua libraries that work are a bit different from the way they need to be on CentOS. When I’m feeling more energetic, maybe I’ll create an rpm for this, or send him a gnutools patch or something, but for now, I can compile the binary for 32 or 64-bit Redhat (with lua-devel installed), and that’s good enough. Taking his lua lexer callback example, which outputs one row per data item, and swapping in this workhorse function:

function values(...)
for i=1, select('#', ...) do
local v = select(i,...)
if v == nil then
io.write(v, '\t')

we get tab-delimited output, which is a good starting point for Hadoop processing. This part of the pipeline just dumps the contents of the table into hdfs with Hadoop streaming. If you did the cat/wc example described in part 1, this will be easy for you.

So now the data world is our oyster. We can look at web server logs and database info, and join across them. We can load in a year’s worth of data and get some quick counts of various things in a few minutes, with the right size cluster. We’ll go into more detail on customizations and techniques in future posts.

A note: as you’re working, watch the disk space. If it gets a bit short, you can add slave nodes with “hadoop-ec2 launch-slaves” (and all the same arguments you gave for the initial cluster launch). You might think you should then run “start-balancer.sh” to even things out. Maybe, but in our experience, at least as of now, and if you’re not careful, it will kill the existing nodes. Just add some slaves and let your jobs gradually balance hdfs. Watch memory on the name node. We have had some jobs fail, and broken them into smaller bits, run them again, and had them succeed. The script could use a patch to make the name node a bigger instance than the slaves.

That’s all for now. There’s a lot more to go into, in future posts.

StyleFeeder @ Sunrise

Sunrise over Central Square 12/21/2009

Sunrise over Central Square 12/21/2009

Sometimes an early morning is called for, and sometimes it pays off.  The remnants of the storm on 12/20 left a painted sky, and it made the early work that much more pleasant.

Google analyzes itself

I was looking at our Google Webmaster Tools (aka GWT) account yesterday and noticed that one of the new “Labs” features is “Site performance.”  Well, that’s the kind of thing that gets my attention.  I clicked on it and noticed that GWT (not to be confused with GWT, another unrelated Google product) was complaining that StyleFeeder.com was slow.  Huh?  One of the helpful suggestions that GWT gave me was that I should make sure to enable gzip compression on our pages.  Well… it is.  In fact, GWT was complaining that the Google Analytics javascript snippet that everybody puts on their pages isn’t compressed.

So, Google, perhaps you could follow your own advice and gzip that sucker?

google webmaster tools says analytics is a problem

Overheard at StyleFeeder: Hadoopable

Hadoopable (adj.): whether a job can be run on Hadoop or not; can be broken into map and reduce stages

How not to name your brand

In honor of 80%20 Shoes, signs you’ve chosen a bad brand name for your shoes, in the age of Google:

  1. Your name contains a URL entity.  When you put %20 in the URL it represents a space character.  Oops.  If your name doesn’t URL encode to your name, you’re doing something wrong.
  2. When someone enters your name in Google, it thinks they’re doing math.  Type 80%20 into Google, and it thinks you’re doing 80 modulo 20.  The result is zero!
  3. Nobody knows how to pronounce it so you can’t tell your friends, and when you do they can’t search for it.

We’re a finalist for the MITX Awards, 2009

I just got an email that informed me we’re a finalist for the MITX Awards this year in the “applied technology” category, which is great news.  You may recall that we won at last year’s award ceremony.  The award nomination this year is for our new product browser and some of the underlying geotargeting technology, which you should check out if you have not seen it already.

CDN Evaluation Criteria

A friend asked me about StyleFeeder‘s experience using CDNs, so I sent him the list of criteria that we use to evaluate the various content delivery networks that we have tried.  We’re currently using Akamai, Cloudfront and Panther for various types of content.  I’ve talked to pretty much everybody in the CDN space over the years and I think that this list of questions is pretty solid.  If you think otherwise, I’d be happy to update this list with any new ideas you have.

Note that if you’re streaming large audio or video files, this list may not be a good one for you.  The questions are biased towards StyleFeeder’s needs, namely the fact that we have tens of millions of small product images floating around.

CDN Evaluation Criteria:

  1. Do you support HTTP compression?  Does content have to exceed a minimum size in order to be compressed?  (i.e. content less than 2Kb is not compressed)
  2. Do you allow us to override the Expires headers being sent by the origin?
  3. Will you obey long Expires headers (5 years) and cache content accordingly?
  4. Once a piece of content has been cached on your network, under what conditions would your CDN re-request that content from the origin?
  5. Do you allow full and partial (based on regexes) cache flushes?
  6. Can you demonstrate that you’re better than one of the “value” CDNs?
  7. How many nodes do you have?
  8. Do you have a lazy-loading caching proxy scheme (like Akamai, Panther, Voxel) to allow for easy deployment?
  9. Is there an API we can use to add or flush content?

Let me know if you have other questions that should be on the list!