KDD 2008: Data Integration

One of the KDD conference sessions I was excited about attending was “Data Integration.” I know, I know, it doesn’t have that sexy ring of “Social Networks,” but it’s a very real problem that StyleFeeder, as well as just about any company that stores gigabytes or terabytes of data, faces. In our case, we collect data feeds from numerous retailers, resulting in a combined catalog of fourteen million products. This leads to a number of challenges, especially in how to integrate the product entries. Each retailer uses a slightly different data format; titles, descriptions, tags, etc. are of varying quality; and there is no canonical identifier to detect identical products being sold by different retailers. From a machine learning perspective, it is a rich data set wherein lie a number of interesting problems. One problem I’ve already alluded to is duplicate detection—how to identify when the same product is being sold by different retailers. Since quality varies widely, we need to identify and remove entries of such poor quality that they may turn up incorrectly in search results (anomaly detection). Some retailers engage in “tag stuffing”; others include tangential or irrelevant description. So, in order to achieve good quality search results, there is a need to clean up the text entries, discarding sections which are unrelated to the product.

I was curious to see whether any of the KDD talks would address issues like these. The short answer is “yes,” but to a limited degree. The presentations tended to either focus on relatively simple problems (such as applying a set of rewrite rules to detect duplicate URLs), highly structured problems (such as automatic record linkage), and/or complex models which would be difficult to apply to a large data set. I didn’t see any work that dealt with data as unstructured and “messy” as ours. On the other hand, there were certainly some interesting ideas. Peter Christen presented a machine learning approach to linking records from disparate data sources. He proposes bootstrapping a Support Vector Machine using highly confident examples, thus avoiding the need for labeled training data. My vote for most interesting paper of the session was “Unsupervised Deduplication using Cross-Field Dependencies” by Hall, Sutton and McCallum. The key idea is that context can help determine whether two items are the same. For paper references, two conference abbreviations may be the same, but title words can be used to disambiguate. Similarly, two different products may have the same name, but description, brand and retailer can be used to tell them apart.