StyleFeeder is hiring: Senior Developer / Architect

We’re hiring hackers. If you’ve got what it takes, read through this and contact us.

Note: no recruiters or third parties, please.

Who We Are

We’re StyleFeeder, and we’re trying to improve how people shop. Online shopping is currently focused on impulse buying, price seeking, and moving inventory. We think it should be about connecting people to products that are the best choice for them, based entirely on individual preferences. This is a big, exciting challenge, and we want to meet it.

Who We’re Looking For

We’re looking for a rockstar developer who wants to challenge themselves, us, and the status quo by pushing the boundaries of what users can do. We need someone full of ideas and talent that can create, refine, and execute concepts from beginning to end, prototype to production. Take a look at our site and ask two questions, “Could I have built this?” and “Can I make this better?”. If the answer to both questions is yes, we should talk.

What We Use

Java, Eclipse, Hibernate, MySQL, Linux, Apache for starters. We also use Subversion, Spring, Flex, Struts, Cocoon, JSP, JavaScript, CSS, XML, the list goes on. We look past the buzzwords like AJAX and Web 2.0 and find the good stuff hidden in the marketing wrapper. We move fast, so we need someone who has real experience and proficiency in most of those areas, and is capable of bringing themselves up to speed quickly on the rest.

Talk To Us

Send us your resume and a brief writeup about a project that excited you, and why it did so. If you have a blog or any whitepapers or articles you can share with us, we’d love to see those too. Email us at dev/at/stylefeeder/com

Singing the praises of Nginx

Nginx logo We recently switched over our static content webserver over to Nginx, easily the most impressive webserver I’ve seen in years. We’re running it on a machine with 8Gb of memory (along with some other stuff), but the nginx process is only using a ridiculously small 1.4Mb. In other words, it barely registers in any measurable way. I’ve seen reports of it handling tens of thousands of concurrent connections on a single box, well beyond where the venerable Apache will scale to.

One unique feature of Nginx is the notion of mirror-on-demand, which I recently explained to a friend as caching but without a removal policy. What we wanted was our static asset server to do was exactly this. We’re using this as a way to get content from our backend app server and then to cache it forever in the filesystem that our nginx servers can read from. It turns out that both Apache and Squid simply cannot do this 100% of the time. Various combinations of Cache-control headers and non-conditional GET requests will cause requests going to the backend, which, in our case, was exactly what we were trying to avoid. Plus, I was also looking for HTTP compression for text/* mime-types. Squid doesn’t do compression, period. It does, however, do de-compression, so you can get gzipped content from a backend server and then serve it up uncompressed, which may be useful in some arcane cases.

I should mention here that our first choice was actually to use Lighttpd with mod_cache, which is capable of doing this kind of “only retrieve this content from the backend exactly once” caching. However, within literally minutes of deploying, we ran into corruption issues with some Javascript and CSS content that we didn’t detect during our testing. I saw many instances of corrupted cache content during the time that we used Lighttpd and mod_cache and quickly decided that I needed to explore other options.

Our experience with nginx was great from the start. Most of the docs are in Russian, but the English docs are very well done and quite adequate all around. Configuration is a cinch and deployment was easy as pie. The only hiccup was that I noticed that nginx does not set a Content-length header on content that has undergone HTTP compression (gzip). This is minor for our purposes.

In short, nginx rocks hard.

Who To Trust?

One of the beauties, and challenges, of the online world is access to an immense variety of people and their opinions. If you live in a major metro area and are interested in that new Bistro that just opened around the corner, chances are that you’ll be able to find opinions and reviews at places like Chowhound, Zagat and Fodor’s. But, what to do with these reviews? There could easily be hundreds of them. How do you know which ones to trust?

My wife was a restaurant nut for quite some time, spending hours every day on Chowhound. But, even though she spent sooooo much time reading post-after-post-after-post, there was still one thing she struggled with: evaluating restaurant prospects based on reviews. One restaurant might have a stellar record—not a single negative review—but no reviews from people whose taste she knew well. Another restaurant might have a few complaints, but the people she knew best seemed to like it. And, the reviews themselves rarely helped matters. They tended to be vague: “The tuna was excellent!” What does that mean? Was it seasoned properly? Was it done correctly—rare center with a flavorful crust? If you gave two people the exact same tuna dish, would they even be able to agree on whether it was seasoned properly and done correctly? Probably not. Everyone has their own slightly different definitions of what constitutes a great dining experience.

A VC thinks that we need an algorithm to find our taste neighbors. I couldn’t agree more. Even my wife, who was willing to put in a lot of work in order to find her taste neighbors, couldn’t keep track of so much information! It’s easy to remember whether there’s someone who shares your taste exactly. But what about the ones that don’t? How do you keep track of which restaurants you disagreed on and in what aspects and to what degree? Even asking a human to determine “taste similarities” for a handful of people is a daunting task. But, it’s an easy task for a computer. The hard part is telling the computer to do with all this information in order to produce a “taste similarity” measure which matches our intuition. This problem largely motivated my PhD thesis work. See the introduction for a short summary.

There are two distinct tasks in determining “taste neighbors,” or StyleTwins (as we call them here at StyleFeeder). The first is sentiment extraction, or converting a free-form, written review into a structured opinion. The second is taking the set of structured opinions across a variety of users and restaurants (or products, or movies, etc.) and producing “interesting” information. “Interesting” information might include predicted opinions, recommendations, taste neighbors, similar restaurants, etc. Sentiment extraction is certainly an interesting problem, but users tend to be quite willing to provide structured feedback if you ask them for it. Much more challenging than acquiring structured opinion information is analyzing it and producing interesting information such as a user’s StyleTwins.

There are many different approaches to modeling rating data. We interpret a user’s ratings as a partial ordering of the items (e.g. restaurants) that are being rated. We model each user and item as a vector of features and parameters. The preference score of an item for a particular user is simply the dot-product between the user’s and item’s vectors. A set of ordered thresholds per user are used to map these real-valued “scores” to the discrete rating values. We learn parameters for users and item simultaneously by minimizing a loss function that penalizes mistakes in the model (e.g. the model predicts a rating of “2” on an article of clothing even though the user actually gave it “4” stars). An advantage of this approach is that it provides us with a representation of users and items which can be used for comparisons. We represent each user or item as a vector in 189-dimensional space. 89 of those dimensions correspond to features we have identified for the users and/or items—information such as gender and age (for users) and tags and categories (for items). The remaining 100 dimensions are reserved for the learning process to identify and represent information about users and items which aren’t so obvious. They might correspond to (for the restaurant example) decor, service, type of cuisine or level of seasoning. Or, they might incorporate other subtle aspects of the restaurant experience which may be difficult to get from a review, but can be determined from a user’s pattern of ratings. These 189-dimension vectors can be compared to each other easily. Vectors which point in the same direction are highly compatible; vectors that point in opposite directions are incompatible. A real-valued compatibility score can be computed by taking the dot-product of two vectors. So, finding the best StyleTwins for a user is as simple as computing a number of dot-products and sorting the resulting scores.

Here at StyleFeeder, our focus is on helping you discover products serendipitously, without asking you to describe exactly what you want. The core problem that we’re working on, taking your opinions (in the form of ratings) and providing product recommendations based on them, is common to many domains: products, restaurants, movies, etc. But, people are part of the equation too. You’ll be much more inclined to trust a review or suggestion from someone if they share your taste/style. And, it can be as exciting to find a person who shares your taste as it is to find a new restaurant that fits your taste to a “T”.

Why Recommendations?

If you’re a user of the StyleFeeder site, you’ve likely noticed that recommendations link at the top of every page. Click on it and (if you are logged in) you will see a list of products: your recommendations. If you’ve never rated anything on StyleFeeder, your recommendations won’t be very interesting—it will simply be a list of the most highly rated products on the site. Begin rating and updating your recommendations and something interesting thing will happen: the products you see will start to become more relevant to your tastes. We use what is known as a collaborative filtering algorithm to determine your recommendations. By comparing how you rate products to how others rate, we can algorithmically find products you might like. The more products you rate, the better we understand your preferences and the better recommendations we can make.

But, why? Why are we doing this? How could a computer possibly understand “fashion” and “taste”? How can the computer possibly know my taste as well as or better than my best friend? In a sense, it can’t. Our recommendation engine can’t “look” at an outfit and give you feedback. But, the engine has an advantage over your best friend: data, and lots if it. The engine can’t “see” the products that are added to the StyleFeeder site, but it has access to the opinions of thousands of users. Each of those users has expressed their taste by rating and bookmarking products. The recommendation engine can identify products you’ll like by comparing your ratings and StyleFeed to that of others’. Whereas your best friend may only be able to browse a few sites a day, the StyleFeeder recommendations engine incorporates and filters thousands of new items daily. It’s also constantly on the lookout for new users who’s style is compatible with yours: your StyleTwins. We’re not suggesting that you ditch your best friend, but the StyleFeeder recommendations engine gives you something that neither your best friend nor all the fashion magazines can provide: a product finding service that’s personal to your tastes.

Genes, Features and Recommendations, Oh My!

The Music Genome Project was created seven years ago with the goal of establishing a set of characteristics (such as “Intricate Rhythms”, “Acoustic Guitar Riffs”) of music which can be used to compare and contrast music in a systematic way. These characteristics are called “genes” by the founder, Tim Westergren, but I’m going to use the term “features” to describe them. Why? Because features are a classic concept in Machine Learning, the field who’s goal it is to automatically extract information and make predictions based on large sets of data. Tim Westergren’s decision was to establish features, then to go through the laborious process of extracting/identifying in a large body of music was a stroke of genius. With such features in hand, any number of Machine Learning algorithms can be applied to user feedback data in order to classify, cluster, organize, or even to… recommend music. Pandora is Westergren’s venture to do exactly that.

Hand-labeling information about songs has its advantages, but also has its drawbacks. According to Wikipedia, the process of labeling a song takes a musical expert 20-30 minutes of time. Is this overhead necessary? Read/WriteWeb compares Pandora to other recommendation engines, prompting the question, “Can tags become genes?” They note that del.icio.us‘s “related tags” serve as a sort of recommendation. But, is there more that one can do with tags? Our answer to this here at StyleFeeder is an emphatic “Yes!” StyleFeeder is a social shopping bookmarking site, allowing users to keep track of their shopping “finds”. We provide recommendations based on users’ bookmarks and ratings. Recently, we updated our system to include additional user and item information to enhance our recommendations: item tags, as well as user gender, age and sign-up source and selections. Much like Westergren’s “genes”, these “features” (as we call them) allow us to more quickly provide relevant recommendations. They also have a subtle advantage—no hand-labeling by an expert required. Are tags noisy? Sure. But, the large majority of them are accurate and relevant. One challenge of being a product bookmarking site is that, unlike Amazon.com, we don’t have a boat-load of meta-data to go along with each product. But, there are still many ways of gathering information about a product. With each product bookmark, the user provides us with an image and a web page. Both the image and web page text are sources of information about the product. The key is knowing how to extract this information. Automatic feature extraction is a sub-field of Machine Learning and has seen much progress of late. Here at StyleFeeder, we can apply techniques for extracting features from text/web (e.g. Blei/Bagnell/McCallum ’02) and images (e.g. Tieu/Viola ’99). Westergren’s Pandora site may be able to take advantage of techniques developed specially for sound/audio (e.g. Mierswa/Morik ’05). It’s not trivial to apply these approaches, but they show great promise for enhancing a wide range of learning applications, especially recommendation systems.

Maximum Margin Matrix Factorization

What the…? That jumble of letters full of alliteration (abbreviated MMMF) is the foundation of the collaborative filtering algorithm we use here at StyleFeeder to provide recommendations. I didn’t invent it. That credit goes to Nathan Srebro, a fellow graduate student of mine during my time at MIT CSAIL, and Tommi Jaakkola the professor who advised both of our graduate student tenures. What I can take credit for is figuring out how to make MMMF fast and practical, so that it can work on real-world data instead of just toy data sets. Want to see the nitty-gritty details? You’ll find a boat-load of them in two theses (Srebro’s and mine) and a handful of internationally-recognized conference publications.

I won’t bore you with all the details here. Rather, I’ll try to convince you that this new algorithm advances the state of commercial recommendations systems in a number of different ways. Why introduce yet another recommendations algorithm? Because commercial recommendation systems utilize rating information poorly. Ask yourself the question: what is a rating? When a user rates an item, (s)he is placing that item within a partial ordering. The order with respect to other, similarly-rated items is undetermined. What the user is declaring is that this item is “better” than lower-rated items and “worse” than better-rated items. What can make ratings difficult to understand and interpret is that organization of the partial ordering is user-dependent. Netflix VP Jim Bennett noted this in a Tech Review article:

…there are many people who rate a movie with only one star or five stars. And there are some people who just rate everything with three stars.

The way in which people use ratings varies widely. In addition to the two cases Bennett illustrates, it is also common to see pessimists and optimists (those who mainly use the lower, and upper, ends of the rating scale, respectively). Many approaches have been proposed to deal with this variability, including centering each user’s ratings (subtracting the mean) and modeling each rating value as its own (e.g. Gaussian) distribution. The difficulty with such approaches is that they impose a distance on rating values—e.g. modeling power is wasted trying to represent two 3-star items identically even though they may not be identical from the user’s point of view. What is needed is a way to model the one thing that the information that the user is clearly providing—the boundaries between the rating levels.

As early as 1980, this problem of estimating a partial ordering was studied in the statistics community by Peter McCullagh. It has since become known as the “ordinal regression” problem. In ordinal regression, there is a single user and many items. Some items have been rated; the goal is to predict the unobserved ratings. Information about the items is used to generalize from one set of data (observed) to the other (unobserved). This is a simplification of the modern-day recommendations scenario, in which there are many users and ratings are not the only observations upon which we can base predictions. However, to build a house, one must have mastery of plumbing (whether directly or via a subcontractor). Similarly, before one can make best utilize user rating information, one must master the ordinal regression problem. The approach McCullagh took was to learn a mapping of items to points on the real number line. Thresholds (one fewer than the number of rating levels) were used to translate the real values to ratings. A visualization of this can be seen in on slide 9 of my thesis defense slides. McCullagh conducted learning by maximizing a certain cumulative distribution. A mapping and thresholds were learned so that most items fell within the corresponding rating region. I.e. the emphasis was on boundaries between ratings (thresholds), not on penalizing distances between items with the same rating. Unlike other approaches to ordinal regression and rating learning, no distance metric was imposed on rating values. McCullagh’s insights were invaluable. However, he was a bit ahead of his time, both in terms of the applicability of his techniques and in terms of the field’s understanding of how to apply regularization to avoid overfitting the model. But, McCullagh made a key insight which elegantly handles to Netflix VP Jim Bennett’s observation about the variability in how users utilize rating. By treating ratings as separate categories and using thresholds to define the rating level boundaries, the model is flexible enough to handle a wide variety of usage patterns.

While working on MMMF, Srebro and I spent some time studying ordinal regression techniques, including what was considered state-of-the-art at time. We found that the insights that McCullagh had over 25 years ago were still highly valuable, but needed updating. The basic structure he proposed—mapping items to real numbers and using thresholds to define rating levels—has proven highly effective as it encapsulates the structure of the problem without imposing unnecessary assumptions. To understand this, it is necessary to understand that ordinal regression is a generalization of binary classification. The problem of predicting ratings when there are only two rating levels is exactly that of binary classification, the goal of which is to understand the boundary, the difference between items which yield the (e.g.) good/bad labels. Binary classification has been studied extensively in the Machine Learning literature, to the point that fast, highly effective methods are well known. Tong Zhang and Frank Oles give an excellent summary of such modern methods in their 2001 journal paper. Additional background may be found on the linear classifier Wikipedia page. Srebro and I studied extensions of binary classification to ordinal regression. A key observation we made was that “loss functions” proposed for ordinal regression were not a bound on the empirical loss, which was a key insight in the design of modern classification algorithms. So, we developed a loss function which was a bound on the absolute error (the difference between the observed rating and what the algorithm predicted). The results were phenomenal. In experiments we conducted on the “1 Million” MovieLens data set, our approach significantly outperformed the technique that was considered state-of-the-art at the time (2005). The technique was such an improvement that other approaches we tested almost never made a better prediction. p-values testing significance of the out-performance of our approach were miniscule: 6e-17 to 7e-22. Using a loss function which was a bound on the error was clearly key to modeling user ratings.

MMMF extends our ordinal regression framework to handle multiple users. They key challenge in extending ordinal regression to multiple users is in determining how best to find similarities between users. Similarities in how users rate allow MMMF to predict unobserved ratings. A popular approach to rating prediction is to use Singular Value Decomposition (SVD) to find a low-rank approximation to the observed rating matrix (where unobserved entries are filled-in with, e.g., user-average ratings). Reconstructing a matrix using the n largest singular values (found via SVD) finds the rank-n least-squares approximation to the original matrix. There are (at least) two problems with this approach: (1) the least-squares loss function models ratings poorly, both because it establishes a distance between rating levels, and it makes the model focus too much on outliers, and (2) the optimization space is non-convex (i.e. there is no well defined solution to the optimization problem). Our insight was to use of a particular regularization term, the trace norm, or sum of singular values of a matrix. Similar to the way the L1-norm has the effect of encouraging null weights, the trace norm encourages null singular values in the parameter matrix. The net effect is a low-rank solution without sacrificing convexity in the optimization problem. Experiments we performed on the MovieLens and EachMovie data sets bear out this observation. We compared versus the nine different algorithms (including an improved variant of the SVD approach) which Ben Marlin tested in his Master’s thesis and found that MMMF substantially outperformed all of the them, on both data sets, according to both evaluation criteria that Marlin proposed.

No recommendation framework is perfect and MMMF is no different. There are always challenges in how to incorporate additional information and make best use of the available information. And, since we are talking about commercial systems, efficiency and scalability are serious concerns.

  • Features/efficiency trade-off: additional information about users and items will provide more predictive power, but will also increase overhead and cpu requirements; a trade-off must be struck between efficiency and the number of features to be used.
  • Non-linear prediction: with classification, it is relatively easy to convert a linear classifier to one that learns non-linear decision boundaries via the kernel trick. The kernel trick is not easily applicable to MMMF because parameters of both users and items are learned. In the typical classification scenario, only “user” parameters are learned and the kernel is applied in the forward direction to item features. To use a kernel with MMMF, one would need to calculate the derivative of the inverse of the kernel function at every optimization step; for some kernels, this would simply require additional cpu and memory overhead; for other kernels, this is impossible.
  • Integrating multiple user judgments: MMMF deals with ratings, but often other user judgments are available, such as purchases or web page visits; StyleFeeder users each have a bookmark list, providing some insight into their tastes; as do many other systems, we currently take a simple approach to this—treating bookmarked items as equivalent to a 5-star rating; however, MMMF can easily be modified to include other judgments as overlaid classification problems using the same set of parameters, incurring minimal additional overhead; we are currently exploring the best way to apply this idea.
  • Providing explanations: users like to know how their actions affect recommendations; recommendations based on simple rules provide less accurate recommendations, but can easily be understood by the user; MMMF better utilizes rating provided by the user, but makes somewhat complex decisions that cannot be explained easily; however, the aspects of MMMF which contribute to recommendations are easily isolated; we are currently investigating the best ways to present these to the user.
  • “You’ve Got Recommendations”: MMMF represents user-preferences with continuous values, so compared to rule-based systems, it is less clear when a product with a high predicted rating will be a valuable recommendation; some experimentation will be necessary to determine when a predicted rating is “large enough” to warrant highlighting to a user

At StyleFeeder, we use Fast MMMF as our algorithm for recommendations. Besides its ability to predict ratings so effectively, there are a number of advantages it has for a commercial recommendation system. We also incorporate a number of modifications which further improve upon MMMF’s predictive power and enhance the user experience with respect to recommendations. Both the advantages and modifications are described below.

  • Representation of users and items as a (parameter) vector of doubles: avoids the need to access the “training” data when making a recommendation or rating prediction.
  • Incremental training: any parameter vector (user or item) can be updated individually; user feedback can be incorporated and new recommendations can be made in seconds—once you’ve signed up for an account at StyleFeeder, go to the recommendations page, rate some items, and reload the page; you’ll see updated recommendations in seconds; incremental training also allows learning to be distributed across machines.
  • Compatibility with different rating systems: since MMMF treats ratings as a partial ordering, it can easily incorporate different numbers of rating levels (5-star vs. 10-star), even within a single data set.
  • Incorporation of features: any information about users and/or items which can be encoded as a vector of doubles can be used to enhance recommendations; at StyleFeeder, we use gender, age, country and sign-up source as user features and tags as item features; this additional information allows us to provide relevant recommendations quickly relative to the total amount of data we have.
  • Personalized search: representation of users and items as vectors of doubles allows us to quickly re-rank search results according to the real-valued scores the system generates for each user-item pair; your search results at StyleFeeder are personalized; if you have rated few or no items, your search results are re-ranked according to overall popularity; as you rate more items, the ranking changes according to your taste.
  • StyleTwins: the compact user representation makes comparing a pair of users a fast computation; part of our goal at StyleFeeder is to connect people with compatible styles so that they can help each other shop more effectively; we call the top matches for a user his/her StyleTwins.

Here at StyleFeeder, we’ve taken MMMF, an algorithm that has proven to be highly effective at predicting preferences based a user’s rating history, and turned it into a functional, commercial recommendations system. Challenges remain in: gathering and incorporating information on users and items, providing users with explanations of their recommendations, determining the correct cut-off point for notifying users of recommendations, and integrating multiple sources of information. But, we’ve already had a great deal of success. MMMF allows us to take maximum advantage of user ratings, while still providing us with the flexibilities of other recommendations engines. Quick updates allow us to provide relevant recommendations to a new user after only a few minutes of rating products. Personalized search means that users can find relevant products even with generic search queries, such as “shoes”. And, the ability to match up users with similar tastes allows us to provide value to users beyond lists of products. We continue to innovate, focusing on maximizing the value to the user while requiring as little explicit user input as possible.