Hadoop for the lone analyst, part 2: patching and releasing to yourself

We left off, in part one of this series, at the point where we had Hadoop running with the Cloudera distribution, version 0.20.1+152. That’s Apache Hadoop release 0.20.1, plus 152 patches that Cloudera’s copious experience tells them they need for work in the real world. But perhaps we’re using, say, Hadoop streaming, and we read something like this, or any one of a dozen comments on the mailing list, which tell us we might need a patch that isn’t among the 152, and we can’t wait for Cloudera to have a need for that same patch. What then? Our approach will be to apply more patches and build the rpms again, so we can continue to use the nifty scripts, and take advantage of the tested base patch set, to which we have become accustomed. Once again, we will need to know a little about a lot of things, so a detailed HOWTO, assuming relatively little expertise in any one area, is called for.

To recap what we had done, and our current base technology stack, we had this:

  • A local CentOS 5 virtual machine, with some extra Python, jdks and other libraries, as detailed in part 1.
  • An Amazon Web Services account, some Cloudera scripts and configuration files, and the ability to run the hadoop-ec2 script to launch and terminate clusters, of pretty much whatever size we want.

We are using Hadoop 0.20.x, because 0.18.x is missing some features we want, and versions 0.21 and beyond are not fully baked. Not only are they not fully baked: as of version 0.21, the core of Hadoop has been split into three projects, and there isn’t even a script (well, there’s a script in patch form!) to build them all together and run them. That’s too much adventure for a lone analyst! Version 0.20 will do just fine.

Now we’ll need a few more things, starting with an rpm/yum repository. This is pretty easy. If you have never looked under the hood of these, look in /etc/yum.repos.d, at any of the .repo files. Grab a ‘baseurl’ link and look at it in a web browser. Look at a few existing ones to get a sense for the naming conventions, which mostly have to do with chip architecture: {noarch,i386,i586,i686,x86_64}. You will need a location to run an Apache httpd server, or something like it. We’re not going to explain how to do that, because it’s too big a topic, but it’s not hard at a basic level, and you can follow these instructions to set up password protection for the root of the directory tree where you’re going to put the rebuilt rpms. Even if you were the type of person who wanted to fork Cloudera’s distribution publicly for no good reason (and we hope you’re not that type of person!), you will typically have reasons that your favorite combination of patched features should not or cannot be released to the world. For example, we want to do this compression thing that Kevin Weil and the guys at Twitter figured out, and the licensing is reciprocal, and therefore not Apache-compatible. So for that and other reasons, we’ll want to keep our little Frankenstein to ourselves. If you know what you’re doing in this area of intellectual property, you won’t need these admonitions. If you don’t, please believe me: you won’t want the public shaming you’ll get, along with probably other nasty consequences, if you start forking and publishing willy nilly.

Final details on the rpm repository: you can use the gpgcheck=0 (i.e. do not check) option for gpg keys for now, what with the password setup. Not a good idea long term, but for the moment we’re focusing on getting this up and running. (Proper operation of cryptographic programs is another big topic we do not want to get into.) To get a certain URI to function as an rpm repository, you need to run the ‘createrepo’ command, with the file-system directory corresponding to that URI as its only argument. This should be somewhere under the level at which you’re password-protecting everything. We have a couple different private repositories going now, one for jdks, one for patched Hadoop. The ‘createrepo’ command creates metadata files about the rpms that are in the repository. Every time you add one or more rpms, run ‘createrepo’ again, with the same argument as before. It’s not the most aptly named of commands. It should be called something like ‘create-repo-or-overwrite-repo-metadata’.

Now we have a destination for our rpms, and we need to create them. Which ones? It depends on what you’re using. If we go to a running Cloudera-style EC2 Hadoop cluster and do something like this: “yum search hadoop | grep hadoop-0.20 | cut -d’ ‘ -f1”, we get a full list, for which we would be able to find the rpms at this location, like so:

hadoop-0.20.noarch
hadoop-0.20-conf-pseudo.noarch
hadoop-0.20-conf-pseudo-desktop.noarch
hadoop-0.20-datanode.noarch
hadoop-0.20-debuginfo.i386
hadoop-0.20-docs.noarch
hadoop-0.20-jobtracker.noarch
hadoop-0.20-libhdfs.i386
hadoop-0.20-namenode.noarch
hadoop-0.20-native.i386
hadoop-0.20-pipes.i386
hadoop-0.20-secondarynamenode.noarch
hadoop-0.20-source.noarch
hadoop-0.20-tasktracker.noarch

Doing ‘yum info’ on all of those, we find that by default the name node only has the first item, and the slave nodes also have conf-pseudo, datanode, and tasktracker. Those are all ‘noarch’ packages, so unless we have a reason to use pipes, native, debug, or libhdfs, we won’t need specific chip architectures to get this working.

If you’ve never worked with rpmbuild, it’s not very difficult. There are lots of HOWTOs. Long story short, as root you need to do “yum install rpm-build”, and then as yourself (DO NOT DO THIS AS ROOT), create a file called ~/.rpmmacros containing something like this:
%_topdir /mnt/usr/rpmbuild
%_tmppath %{_topdir}/tmp

Inside whatever you have for %_topdir, “mkdir {BUILD,INSTALL,RPMS, SOURCES,SPECS,SRPMS}”, and inside %_topdir/RPMS, “mkdir {noarch,i386,i586, i686,x86_64}”.

Next, you will want to see if you can add a patch and rebuild the Cloudera distribution. One thing at a time. Let’s rebuild the distribution with no changes, and make sure that works. Find the download link from this page (currently this). You need a 64-bit CentOS 5 for this. Remember I said we would have a use for that in part one? 64-bit anythings are memory hogs, so we do this on EC2. On our hadoop builder machine, my .bashrc file looks like this:

# .bashrc

# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
export JAVA_HOME=/usr/lib/jvm/java-1.6.0-sun
export JAVA32_HOME=/usr/lib32/jvm/jdk1.6.0_14
export JAVA64_HOME=/usr/lib/jvm/java-1.6.0-sun
export JAVA5_HOME=/usr/lib/jvm/java-1.5.0-sun
export FORREST_HOME=/mnt/usr/apache-forrest-0.8
export ANT_HOME=/mnt/usr/ant/apache-ant-1.7.1
export PYTHON_HOME=/usr/lib/python2.5
export ECLIPSE_HOME=/mnt/usr/eclipse/eclipse-europa
export JDIFF_HOME=/mnt/usr/jdiff-1.1.1
#export XERCES_HOME=/mnt/usr/xerces-c/xerces-c_2_8_0-x86-linux-gcc_3_4
export XERCES_HOME=/mnt/usr/xerces-c/xerces-c_2_8_0-x86_64-linux-gcc_3_4
export FINDBUGS_HOME=/mnt/usr/findbugs-1.3.9
export PATH=$JAVA_HOME/bin:$ANT_HOME/bin:$PYTHONHOME/bin:$PATH

That list includes all the libraries you need to build the regular Apache project. You will have to find them on the internet and download them. Look at the .spec for what else you’ll need: you’ll need git; you’ll need to do “yum install ant ant-nodeps”. This will install (for now), ant 1.6.5 at /usr/bin/ant. Note that this is not the ant we are using for the build. We’ve been very naughty and just moved /usr/bin/ant to /usr/bin/ant-1.6.5, so we can have /usr/bin first in our PATH some of the time, but always get the ant we really need, and also have rpmbuild work. Yum will whine at us at some point in the future, but we are not going to worry about that for now. Is that enough of a web of dependencies for you? It’s a good thing we’re doing this on EC2, so we don’t have to live with this slightly freakish box too much of the time.

You should also download the source distributions of automake 1.10 and autoconf 2.60. The system automake and autoconf on CentOS 5, for now, are 1.9 and 2.59 respectively, and various parts of the Apache and Cloudera builds/rpms need one or the other of these pairs of tools. It’s easy: untar the source file, cd into the directory, “./configure; make; make install” (as root). This will deposit the more advanced versions of automake, autoconf and aclocal in /usr/local/bin. When the builds complain at you for something in the autotools, rejigger your PATH by putting /usr/bin or /usr/local/bin first.

So now we have an intact base CentOS 5 system, with alternate jdks (Sun 1.6 64 bit, Sun 1.6 32 bit, and Sun 1.5), alternate Python (2.5, could be 2.6) and alternate GNU autotools. The Cloudera guys helped us out here when we got stuck while trying to do this on Fedora 8. That’s an example of what we meant in part one, about rat holes you could go down, if you don’t want to follow these directions step by step. I’m fairly confident this all could be made to work on Fedora, or Debian, or Ubuntu as well (and certainly with less messing around with Python), but now that we have it working on CentOS, we’re not going to try any of that. It turns out CentOS 5 is what Cloudera uses to do their builds, and that is good enough for us.

Get a shell with an environment as in the .bashrc file, set an environment variable in your shell called “FULL_VERSION” to whatever the full patched version number is (in our case this would be “export FULL_VERSION=0.20.1+152”). Cd to the untarred distribution directory, and do cloudera/do-release-build. You may have to monkey with your PATH, as indicated above. If you don’t set FULL_VERSION, it will build, but there will be an odd directory in the build tree, and rpmbuild will complain later because you don’t have the Sqoop documentation. So remember to do that. When you have the built source distribution file, you are ready for the rpm work.

Download the source rpm from here. As yourself, do “rpm -i hadoop-0.20-0.20.1+152-1.src.rpm”. That will unpack various files into the subdirectories of %_topdir. Copy your built source distribution file over what it puts in %_topdir/SOURCES. cd to %_topdir/SPECS, and check out hadoop.spec. The command you need to run is “rpmbuild -ba –target noarch hadoop.spec”, but it didn’t just work for us. Our java/jdk package names (remember the jpackage process we mentioned in part one?) are not the same as Cloudera’s, so we had to edit the dependency/prerequisite “jdk >= 1.6” to be something that fit with our naming convention. We had some problems with a Python rpm script that wanted to use the system Python, and wouldn’t work if it ran in a shell that had the PYTHONHOME and corresponding PATH item from our .bashrc settings as above. So we commented out the PYTHONHOME thing, which we had needed for running the cluster and building, but now foils us on the rpm side. There’s a lot of re-arranging your PATH in this process: it’s kludgy, we know. But we got a new shell and it just worked, and built all the rpms we need, under %_topdir/RPMS. Awesome!

Now you’re ready to patch, so go back to where you were doing the building. We made a recursive copy of the untarred distribution directory, and cd’d down into that. Under the root level there is a ‘cloudera’ directory, with some scripts, documentation, and all the patches they have applied. We could have re-applied all the patches from scratch to a virgin Apache distribution, but we really just needed to add one thing the first time, so we downloaded that patch file, copied, renamed and modified their apply-patches script (let’s call this ‘apply-one-patch’), like so:

#!/bin/sh -x
set -e
if [ $# != 3 ]; then
echo usage: $0 ' '
exit 1
fi
TARGET_DIR=`readlink -f $1`
PATCH_DIR=`readlink -f $2`
PATCH_FILE=$3

cd $TARGET_DIR

# We have to git init...
git init-db
for PATCH in `ls -1 $PATCH_DIR/$PATCH_FILE` ; do
git apply --whitespace=nowarn $PATCH
done

and then ran it with our patch file as the last argument. TARGET_DIR needs to be the ‘src’ directory under the untarred distribution root. Plain old unix ‘patch’ would no doubt work as well. Now we need a naming convention for our version number. Guessing that things will just work with increments, on both the building and launching side, we go with “export FULL_VERSION=0.20.1+153”. When I’m not a busy lone analyst with patterns to find, I’ll read those scripts and figure out some more sensible thing, but I’m trying to get back to work here! The last official release increment from Cloudera went from +133 to +152, so we’re not too worried about a future name collision. Then we execute ‘do-release-build’, and it seems to work. If you’ve done all this, then under the ‘build’ directory, there should now be a gzipped tarball of your new distribution.

Next, we copy that tarball into the %_topdir/SOURCES directory, and then go back into %_topdir/SPECS. Editing hadoop.spec, we tell it our new version number. Then once again “rpmbuild -ba –target noarch hadoop.spec”.

Now everything should be built, so you can copy the rpms to the ‘noarch’ subdirectory of our rpm repository root on your password-protected http server, and run ‘createrepo’ again over there.

At this point, you just need a way for a cluster in the process of launching to find that repository. You can either put a .repo file in a private ami (your password would need to be in that .repo file, so the image would need to be private), or make it part of the script to put that file everywhere as the cluster is launching. Locally, edit your ec2-clusters.cfg file if needed, and fire away. Do “hadoop-ec2 login mycluster” and do “yum info hadoop-0.20”. You should see a “hadoop-0.20.noarch” package with your new version number.

The big data world is your oyster, and now you can fine-tune your tools. We hope this helps our fellow lone analysts out there.