Cool File Viewer

As part of our effort around product data feeds, we get a lot of delimited text files of varying size and content delivered on a daily basis.  Previously, we primarily used OpenOffice.org Calc to view these files, since it has a fairly flexible text import facility and can deal with CSV files pretty well.  For some of the larger files it could take minutes to open and more alarmingly minutes to close, and this was placing major time constraints when a large number of new files came in and needed evaluation.

A survey around the web showed a few different approaches to dealing with delimited files.  Most involved configuring a desktop database like MS Access or a local install of MySQL to import the file, but that would involve creating a table and running a loader.  Really all we want to do is view the file and see that the important fields are being filled in and that the values are not abusive of the column semantics.  Other approaches mirrored our own, using OpenOffice.org or MS Excel as a viewer; there were also a few specialized tools to open delimited files but not spreadsheets.  Having tried a couple of these, they didn’t impress with their speed or usability.

Our answer came from halfway around the world, in New Zealand.  Kiwi Log Viewer, a tool designed to view web server logs on Windows, could open up tab delimited files in an incremental fashion.  This provided a grid view of the file in a really speedy manner.  The problem is that most of our files are pipe delimited, and Windows isn’t so friendly at doing ‘sed’ on a file before passing it into a program.  Seeing that we were 90% of the way to what we wanted, we wrote to their support line and sure enough they’ve added a configurable delimiter to the features of their next version, currently available in beta.  While we’re big Open Source fans here, a responsive software company that cares about their curstomers is the next best thing.  If you’re looking for a good program to view delimited files or log files on Windows, check them out.  There is a free (as in beer) version as well if you want to try it first.

MySQL Analysis Tools

I’ve been having some odd performance issues with some of my MySQL queries since moving over to InnoDB.  I picked up the new O’Reilly title High Performance MySQL to try to track down the problem.  The book in turn recommends a couple of pretty cool monitoring/reporting tools that summarize a lot of the MySQL variable displays in a more friendly format.

Innotop (more info here) is sort of like the friendly unix top command, but instead for database status.  There are pages to show buffer statuses, deadlocks, i/o status, current queries, and lots more.  They all update on screen, at configurable increments.

MySQLReport from hackmysql.com runs a few status commands and formats them nicely on screen in a nice grouped format.  This guide summarizes the sections, which include more detail on many of the same things that innotop covered.  I find the sections on SELECT types, and InnoDB Buffer Pool use, are especially useful to me.

Using the command type summary, we discovered an inordinate number of com_rollback calls in our main database, which we were able to reduce by using this technique.  The root cause was Hibernate‘s love of transactions, combined with connection pooling.  A simple driver parameter seems to clear it up.

Happy 4th from StyleFeeder engineering

Boston was a key city in the War for Independence, and to celebrate that there is a big fireworks display over the Charles River every year.  This year the Erics of StyleFeeder (Savage and Kilby) were looking for a place to watch the show.  Down by the river was a mass of humanity, even compared to previous years, so we were worried about missing the show.  Then it occurred to us, doesn’t our office have a view down towards the river without many tall buildings in the way?  This proved to be the case, and we had a spectacular view of the show directly over the famous MIT Dome.

(photo by Eric Savage)

(photo by Eric Kilby)

The rest of the pictures can be found at these links via Flickr.  Remember, Creative Commons licensing is your friend, so feel free to use these as you wish.

Is that server running a bit slow?

top - 12:59:09 up 6 days, 15:51,  4 users,  load average: 1050.04, 753.62, 356.
Tasks: 150 total,  24 running, 126 sleeping,   0 stopped,   0 zombie
Cpu(s): 12.2%us, 87.8%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   2059260k total,  2049260k used,    10000k free,     2412k buffers
Swap:  4208608k total,   523792k used,  3684816k free,   116432k cached

*shakes head in disbelief*

LinkShare Golden Link Awards and Symposium

A few weeks ago we got a pointer to the LinkShare developer contest, open to any of their publishers who are using their web services in interesting ways. I wrote up an entry, mailed it in, and didn’t think too much about it for a while. Then a couple of weeks ago we got a message saying that we’re finalists for their Technology Genius Award, and that we should plan on having someone attend the festivities.

Fast forward to this week and I boarded the LimoLiner bus and headed off to New York. A few hours later I was checking into the Sheraton, and not long after that I was boarding another bus for the Plaza Hotel and the Golden Link Awards ceremony. After spending some time in a very dressy crowd, feeling like a fish out of water, I made my way to my seat at a table near the front of the room and after a nice lobster appetizer and steak dinner it was time for the show, hosted by Susie Essman (best known from Curb Your Enthusiasm).

Well into the program, it was time for the Technology Genius Award and the butterflies started going in my stomach. And the award goes to…. StyleFeeder! Went up, shook some hands, made a speech, and sat back down, all in a blur. After the show was over, got on another bus and returned to the hotel with trophy in tow. People kept asking if it was an Oscar, and I kept answering in the affirmative.

The next morning, up bright and early, I got on another bus to head down to Chelsea Piers down by the Hudson River, site of this year’s LinkShare Symposium. The morning was filled with speakers, headlined by James Surowiecki, the author of The Wisdom of Crowds. After lunch and more presentations, it was networking time, in which I was somewhat out of my element. During that time I ran into Adam Weiss of LinkShare, who had helped me with some technical issues back in the Spring, and he introduced me to Jessica Kingman who is our account manager over there. They both had good suggestions in terms of people who I should meet and talk to, and I made my way around to several of the advertisers booths exchanging cards and collecting conference swag. I even won another contest, taking away a nice 9 bottle wine cellar courtesy of our fellow Bostonians at SmartBargains.com. The conference finished around 6, and I headed back to the hotel with the guys from Buzzillions.com to drop off our bags.

After a long New York style evening of after, after-after, and after-after-after parties and some much needed rest, I boarded the LimoLiner back to Boston, with a shiny trophy nestled in my bag. It was worth the trip, and thanks to LinkShare and all their friendly people for making it a great trip.

Facebook Gotchas

We just did a little refresh of our Facebook profile box, and I learned a couple things along the way that would have been nice to know ahead of time:

  • I knew that Facebook caches all referenced images, but I didn’t know that they will resize any referenced image larger than 400px down to 400px.
  • If an image is inaccessible for any reason (404, timeout, etc) it will be replaced with a blank image and cached, so you will need to change the name of the image when doing iterations.
  • FBML includes <fb:narrow> and <fb:wide> for providing different content to the two available profile columns, and this will work even with the Ajax-y reloading of a user moving the box. However, Facebook rewrites your css so that you don’t mess up the page, and the drag-n-drop doesn’t affect this, so don’t put css in these two tags. Instead, put your HTML in the tags, with different classes/ids.

Xconomy cloud computing event

I had the pleasure of speaking on a panel about cloud computing today in Boston. The panel was part of a conference on the subject hosted by Xconomy in Akamai‘s wonderful facilities in Kendall Square. One of the things that jumped out was that there absolutely is no consensus of what cloud computing is, an observation made by Josh Coates, who pointed to the currently fluffy state of the Wikipedia page on the subject.

John Landry did a great job moderating the panel and keeping things lively – he had some particularly good observations in his opening remarks about some of the key forces behind cloud computing (no matter what your definition is), including open source software, cheap hardware and virtualization (plus two more that I can’t remember). However, I think the most important factor that will determine your ability to join any kind cloud computing program is your system architecture. It’s one thing to be able to spin up 100 virtualized Linux boxen on EC2, but it’s quite something else to be able to integrate those dynamically into a running system. If you can’t do that, then you’re at a disadvantage (I made the point that we at StyleFeeder have 100 databases in production and that we can very easily move those around to scale up our data tier). John also talked about the gradual move away from traditional relational databases to key/value stores, which prompted some good discussion.

The conference was full of people who were clearly interested in the subject matter, but it seemed like many of them haven’t yet taken the plunge. Contrast that with another talk that I gave this morning at the Yale Entrepreneurial Institute: how many of these students had heard of EC2? Pretty much all of them who are doing software startups. Not surprising. I bet most of them are using virtualized systems of some flavor.

Cloud computing seems to be on everybody’s radar screens these days, even if nobody seems to have a clear idea of what it is.

(Elias and Yoav were also at the cloud computing event, but Elias left early because he’s lame.)

Two must-have Java tools

I’ve been absolutely loving JavaRebel, a java agent that reloads modified classes at runtime, thereby obviating the need to start or stop some long running piece of code (say, an app server or some other kind of daemon). It’s not, unfortunately, open source, but it is inexpensive and the time savings so far are considerable. I ran it for three weeks using their free trial version expecting to find some kind of showstopping bug… but that never materialized. Integrating it is really easy… just a simple change to the command line that runs your java code. I’ve only found one caveat that requires any thought at all, which usually has to do with the modification of static fields, but it’s really not a big deal. I reckon that this saves me 15 minutes per day, which… well, it’s quite a lot. And I don’t remember the last time that I found software that could do that for me.

I also had a little incident with a 1.5Gb heap dump yesterday. I wanted to analyze it after one of our app servers coughed it up (right before it crashed hard) to find out what the problem was. I tried jhat, which seemed to require more memory than could possibly fit into my laptop (with 4Gb). I tried Yourkit, which also stalled trying to read this large dump file (actually, Yourkit’s profiler looked pretty cool, so I shall probably revisit that). I even tried firing up jhat on an EC2 box with 15Gb of memory… but that also didn’t work. Finally, I ran across the Eclipse Memory Analyzer. Based on my previous two experiences, I didn’t expect this one to work…. but, holy cow, it did. Within just a few minutes, I had my culprit nailed (big memory leak in XStream 1.2.2) and I was much further along than I was previously. I actually downloaded the standalone version… I usually don’t need memory analyzers very often, so I didn’t want to saddle my regular Eclipse with something new. I highly recommend this tool – it worked like a charm with a very big file and was also very easy to use.

StyleFeeder won an MITX award!

Last night, we had a bit of a (good) surprise by winning an award from MITX in the “Collaboration and social networking” category. Hey, it’s not every day that something like this happens, so we have to savor these moments!

MITX award, 2008

Our friends at HubSpot also won also won (Hey, Dharmesh and Yoav!). I ran into Brian Halligan (CEO of HubSpot) and well-known troublemaker Ed Lyons after the event (Ed then proceeded to lecture Katie Rae of Microsoft about the pitfalls of SharePoint, but let’s just pretend that part didn’t happen).

Brian Halligan and Ed Lyons

One thing I’m particularly glad about is that we received our award before the fire alarms went off, which prompted a mass exodus out in front of the hotel over looking the river. Hey, if you have to evacuate your hotel due to a fire threat, that’s a nice place to do it (our previous experience with flames involved a bit more action).

Congrats to the other winners from all of us at StyleFeeder!

Generating Primary Keys

A primary key for each row of a table in a database is virtually a requirement of database design. Occasionally, the data for a table provides a primary key (e.g. username or email for an account table). More common is that one needs to generate primary key values for a table. Yet, tools for this in MySQL/Java are limited. MySQL offers auto_increment, but there are issues with replication, it can become a bottleneck for insert-heavy tables, it doesn’t provide globally unique ids and displaying these ids publicly may expose sensitive information. Java offers java.util.uuid, which gives pseudo-random 128-bit values. The chance of a collision is minuscule, but non-zero. More troubling is the size of the string representation: 36 characters. Since InnoDB uses the primary key index as the storage structure for the data and uses primary keys as data pointers for secondary indexes, long keys not only waste space, but make the database less efficient.

After evaluating these options and a few ideas of our own for primary key generation, we settled on a simple algorithm motivated by group theory. The advantages of this algorithm are numerous:

  • Short Keys (6 characters yield 57 billion unique keys using only alphanumeric characters)
  • Universal Uniqueness (no guessing to which table a key value refers)
  • Pseudo-randomness (keys don’t follow an obvious pattern)
  • No Duplicate-Checking (keys are guaranteed to be unique until a limit is reached)
  • Block Generation (keys are generated in blocks to minimize lock contention)

Our generator uses one tiny bit of group theory: if k and n are coprime (aka relatively prime), the sequence of numbers generating by successively adding k (mod n) will not repeat through the first n values. This leads to the following algorithm for generating unique keys:

  • Pick a size n
  • Pick a value k which is coprime with n
  • To generate the next key: nextKey = (lastKey + k) % n

You’ll be guaranteed to not see duplicates until you’ve generated n keys. The sequence you’d see with n=5 and k=3 is { 0, 3, 1, 4, 2, 0, 3, … }.

Note that the choices of n and k are quite important—they must be fixed and can never change. However, selecting reasonable values is not difficult. For n, select a character set and string length, then set n to be the number of possible unique strings. To get the 57 billion value above, use a string length of 6 and a character set of [0-9a-zA-Z] (62 characters). 57 billion is simply the number of unique, 6 character alphanumeric strings (62^6). If you grow to the point that you are worried about key collisions, switch to using 7 character strings (where n=62^7, appx. 3.5 trillion). Note that conversion from the key number value to string value is simply a conversion from base 10 to base 62 (or whatever # of characters you are using).

For k, we need a value that is coprime with n. To achieve pseudo-randomness, k should also not be too small (the same order as n is a good choice). Note that this “randomness” is quite weak in a mathematical sense, but was sufficient for our purposes. One way to select such a k is to multiply together prime numbers larger than the character set size. For our example, a reasonable choice would be k=67*71*73*79*83*89. If you don’t have your own prime number generator, consult the bear.

To put this algorithm into practice, one needs to ensure that keys are generated serially. We did this by creating a table with a single row with a single column storing the last key value. When we want to generate a key (or block of keys), we start a SERIALIZABLE transaction, read the last key value, generate key(s) per the above algorithm, then write back the last key value we generated and close the transaction. To minimize contention and since next key computation is much faster than a transaction, we generate keys in blocks and serve them out of memory via a synchronized HashMap. This causes key values to occasionally be permanently lost when a webapp is shut down, but the lossage is too small to be of any real concern.

We’ve been using this system for many months now and have yet to run into any problems. It satisfies all of our current needs and has the advantage that it can easily scale either by using longer character strings or increasing the key generation block size. Furthermore, it seems to be extremely lightweight, exerting minimal pressure on our database. We would love to hear what other solutions for primary key generation are used. How does ours compare?