KDD 2008: Mining Social Networks

One of the most popular sessions at KDD was the Social Networks panel. Social network data has long existed and been found useful for discovery and prediction, but only recently has the scale, availability and value of social network data reached tremendous heights. One of the most well known sets of social network data is telephone call records. Telephone companies found that by looking at call records, they could better identify fraudsters—people who hijacked telephone equipment to get free calls. But, how social network data can be best utilized is not always obvious. With the call record data, the fraudsters did not call each other (were not directly connected), but rather were often found to call the same seedy number, say a pay phone in the back of a certain bar. Other sorts of “anomaly detection” discussed by the panel included identifying “bad” stock brokers (which were often found to work at the same branch office) and computer network intrusion detection.

Historically, the applicability of social network data has been limited to specialized applications. One question that was asked repeatedly during the session was: “Where’s the money?!?!” One of the largest on-line gold mines is search advertising, yet social network data provides relatively little promise for enhancing search ads because such ads are already quite focused. The panelists pointed out that there is much money to be made in non-search ads, product recommendations and direct marketing. Given information about on-line users’ relationships, one can better predict what ads they will click on, how they will respond to direct marketing, and, ultimately, what products they will buy. If Alice and Bob are friends and Alice owns an iPhone, chances are good that Bob will be buying one soon and iPhone marketing is more likely to convert Bob than Chris, someone with no iPhone-owning friends.

Research on social networks has largely been limited to problems of prediction and detection. But any marketer or retailer will naturally wonder: can we modify the network? Can we create relationships that will cause rapid adoption of our product(s)? The panelists had relatively little to say in response to questions of this sort, other than to say that it is a difficult problem—social psychologists have been working on the problem for decades. I wonder if the reason why it is so difficult is because it requires significant interaction between a learning system and a real social network. Simulation isn’t enough—if one could accurately simulate the effects of changes to the network, one could easily determine the optimal changes. Yet, your typical machine learning/data mining research can’t actively enact changes to a live social network; and, a static social network data set, even if it includes a log of changes, is of limited value. So, the initial data mining work on modifying social networks will likely happen behind closed doors. But, maybe this is a problem that is too interdisciplinary and context-dependent to be treated as a data mining problem. On the other hand, one needs to look no further than the likes of Disney, Nintendo and Apple to see the rewards of proper seeding of the social network. Maybe one day we’ll be able to prove theorems about what certain businessmen have known for years…

Video of the Week: My Hands are Bananas

Not new, but still worthy of VOTW status.

Hello from Vegas: StyleFeeder hits KDD 2008

I’m representing StyleFeeder at this year’s KDD conference, held in Las Vegas, Nevada. It might seem odd mixing seductive showgirls and stodgy statisticians, but I think it’s an excellent location choice. Gambling concepts such as probability, expected value and exploration vs. exploitation are core to many concepts in Machine Learning, Data Mining and Statistics.

KDD played host to the 2nd Recommender System/Netflix Prize Workshop. Gavin Potter showed us that users, movies, and even ratings sessions (date) impart significant biases on ratings, so much so that a model which simply captures these biases and completely ignores user-movie affinity yields a lower error score than than the original CineMatch algorithm. After some discussion of the fact that minimum-error recommenders tend to yield popular and somewhat uninteresting recommendations, Oscar Celma and Pedro Cano presented a study of this effect on music. They found that a collaborative filtering similarity metric was strongly biased toward popular music, whereas content-based and expert-based similarity metrics made it easier to explore “the long tail.” Next, a member of the Gravity Team, Gabor Takacs (who I later learned is the author of the best “big board” tic-tac-toe player in the world), provided a detailed description of their methods for the Netflix Prize. Their approach is an SVD-like matrix factorization, which incorporates incremental training, regularization, user/item bias, positivity constraints, and neighbor-based correction.

Based on discussions and other presentations, it sounded like a combination of matrix factorization and neighborhood based methods was the most common successful approach to the Netflix Prize competition. Everyone at the workshop seemed to agree that Netflix did a surprisingly good job of selecting a goal for the competition: Netflix requires a 10% improvement over their CineMatch algorithm and the current top team has a 9.15% improvement. The difference seems small enough that the 10% goal must be reachable, but progress has slowed considerably, with improvement of only .72%-age points since the first progress prize was awarded last October.

As the main conference has started, it has become quite clear what the “hot” topic of the year is: social network modeling. Sessions on the topic have been packed and some top figures in the community have presented papers on the subject…

Startup Signage

This isn’t technology related, but every startup on the planet should do what we did for our signage. It’s cost effective and looks great. Check out the photos.

Are there really 46 megabytes of underpants in the world?

Getting data feeds from hundreds of vendors, it’s always surprising how big or small some of them can be.  Sometimes there is a rational explanation, like when they include a separate product entry for every size or color.  Sometimes it’s just astounding how much variety there is in certain product categories that I wouldn’t give two thoughts to.  I was pulling a new feed this morning, and the title of this post came to mind.

Amazon’s new elastic block service

I have a little piece on the wonderful Xconomy site about Amazon’s new EBS service that you should check out.

Memcached vs MySQL

I recently had lunch with Dan Weinreb who I met at the Xconomy cloud computing event back in June.  We talked about many topics, mostly scalable database architectures, but also about caching.  He mentioned that he was doing some stuff with memcached lately, which I found very interesting.  Now, memcached certainly has some nice features, but I mentioned to him that I found its performance to be surprisingly lackluster.  But people still rave about it and use it in really big installations (i.e. Facebook).  Yes, we do use memcached in production at StyleFeeder, but it’s not in widespread use.  Instead, we rely on sharding our data across 100 MySQL databases.  This works really well for a number of reasons, not least of which is the fact that we cannot fit all of our data in memory cost effectively.  We also have stringent performance requirements for our site, which means that we need to have very simple data access paths.  Most pages on our site can be loaded with one single database query.

Dan mentioned that someone he knows did some basic benchmarks that clocked in around 700 requests per second.  I wanted to see what our numbers were like.

(Before I share these numbers, I want to emphasize that I’m not ready to hang my hat on these numbers yet, but I figured I’d share them for comments.)

100,000 get requests executed serially:

Memcached: Requests per second: 684
MySQL: Requests per second: 884

Surprising, eh?  This is for the same data coming out of one our shards and the same data coming out of memcached.

I have more unanswered questions: instead of doing this serially, what happens when I have 20 concurrent threads pulling data out?  Does the memcached client library make a big difference?

I also wonder in what cases it makes sense to use memcached.  If you’re like us and have more data than you can reasonably hold in memory, you probably can’t use memcached unless you’re able to hit your main data store without a big penalty.  If you have an amount of data that can fit in memory, you should use something like Whirlycache (only relevant if you’re using a jvm), which did 2,500 requests per millisecond for the same test.

If you simply need to share data across a wide range of nodes, does memcached even make sense at that point?  Perhaps in the case of a more dynamic architecture, memcached and libketama are pretty key.  Rigging that machinery manually with a MySQL backend is possible, but not the kind of thing you’d want to focus on unless you’re doing systems work.

I’m curious to hear what people think, because there’s certainly a lot of conventional wisdom behind memcached that I can’t understand right now.  Francois seems to be in the same camp as me.

StyleFeeder’s Funny Video of the Week

If there’s one thing that we have here at StyleFeeder, it’s… well, a lot of data. If there’s another thing, I’d have to say that we all have a healthy sense of humor. Because we basically sit in front of teh intrawebs all day long, we are exposed to many humorous diversions, which we intend to share with you for your enjoyment. So, each Friday, you can come back here for our favorite video of the week.

Several of us are big fans of Flight of the Conchords. I actually discovered them from Erlack via the recommendation system on StyleFeeder… yes, really, that’s totally true. Anyway, enjoy and please put a note on your calendar to come back next Friday to enjoy the video that distracted us the most during the week.

How Big is that Process?

As I have, you might be tempted to think that top “VIRT” and ps “VSZ” values represent the entire size of a process, including all shared libraries and swapped-out data. Today, we learned that this can’t possibly be the case:

top - 14:41:53 up 86 days,  3:22,  3 users,  load average: 13.57, 6.67, 4.00
Tasks: 162 total,   3 running, 159 sleeping,   0 stopped,   0 zombie
Cpu(s): 11.7%us,  5.7%sy,  0.0%ni, 45.0%id, 37.1%wa,  0.0%hi,  0.4%si,  0.0%st
Mem:   8102704k total,  8056796k used,    45908k free,     9004k buffers
Swap:  4008208k total,  1274816k used,  2733392k free,  2075736k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
 1472 mysql     15   0 19.8g 4.0g 7440 S    1 52.1   1374:57 mysqld             

You can’t have a 19.8 gig process on a machine with only 12 gigs of memory! One anonymous commenter of Virtual Threads: Understanding memory usage on Linux suggests a plausible explanation: memory fragmentation.

Modified in the Past, Previously Known as the Future

Commons Configuration is a huge timesaver—flexible, simple configuration management with the highly useful feature of automatic file reloading. But, lately, that highly useful feature hadn’t been working for me. I’d make a change to the config file in a deployed application and see no change to the values in the application. WTF? What’s crazy is that it’d be inconsistent—some days it would work, others it wouldn’t. Yesterday the problem came to a head—I needed the reloading functionality to work, but it just wasn’t happening. So, I put together some simple test code and, viola, it worked! Huh? Back to the deployed application: no workie. Back to the test case: works like a charm. Egad!

Some JavaDoc digging led me to realize that the problem might be something screwy with the file last modification times. Sure enough, some of the deployed config file unix mtimes were in the future. Digging into the Commons code revealed that a file is only reloaded if the current mtime is ahead of the initial mtime of the file. So, Configuration wouldn’t reload the file until the present had caught up with the future. Ack! But, where were these future timestamps coming from? Ah, of course… (with gritted teeth) Maven. Turns out that the Maven packaging process was storing UTC timestamps in the war file. Yet, when we exploded the war file using unzip, those timestamps were not converted back to our local (EDT) time, so any file modified around the time of the packaging ended up 4 hours in the future.

The solution turned out to be simple: export TZ=”America/New_York”. This caused maven to store our local, rather than UTC times. Unzip’s behavior was unaffected by the setting of TZ, so the deployed files had the correct timestamps after all. A little test clarifies why Maven is doing this. Maven uses java code to perform the packaging and the java File.lastModified() method returns the UTC timestamp. What’s a bit appalling is that my local timezone is specified on my local system in the usual way: via /etc/localtime. Basic unix utilities like ls, stat and unzip don’t seem to have any problem doing the conversion. But, I don’t think Maven can be blamed since it’s probably doing the reasonable thing and using java.util.TimeZone.getDefault() to determine the time zone. A bit of searching reveals that the blame lays squarely on Sun’s head for using a poor hack to determine timezone on Linux. Amazingly, the bug describing this problem has been outstanding for over two years. Sun, is it really that hard to parse an /etc/localtime file?