## Maximum Margin Matrix Factorization

What the…? That jumble of letters full of alliteration (abbreviated MMMF) is the foundation of the collaborative filtering algorithm we use here at StyleFeeder to provide recommendations. I didn’t invent it. That credit goes to Nathan Srebro, a fellow graduate student of mine during my time at MIT CSAIL, and Tommi Jaakkola the professor who advised both of our graduate student tenures. What I can take credit for is figuring out how to make MMMF fast and practical, so that it can work on real-world data instead of just toy data sets. Want to see the nitty-gritty details? You’ll find a boat-load of them in two theses (Srebro’s and mine) and a handful of internationally-recognized conference publications.

I won’t bore you with all the details here. Rather, I’ll try to convince you that this new algorithm advances the state of commercial recommendations systems in a number of different ways. Why introduce yet another recommendations algorithm? Because commercial recommendation systems utilize rating information poorly. Ask yourself the question: what is a rating? When a user rates an item, (s)he is placing that item within a partial ordering. The order with respect to other, similarly-rated items is undetermined. What the user is declaring is that this item is “better” than lower-rated items and “worse” than better-rated items. What can make ratings difficult to understand and interpret is that organization of the partial ordering is user-dependent. Netflix VP Jim Bennett noted this in a Tech Review article:

…there are many people who rate a movie with only one star or five stars. And there are some people who just rate everything with three stars.

The way in which people use ratings varies widely. In addition to the two cases Bennett illustrates, it is also common to see pessimists and optimists (those who mainly use the lower, and upper, ends of the rating scale, respectively). Many approaches have been proposed to deal with this variability, including centering each user’s ratings (subtracting the mean) and modeling each rating value as its own (e.g. Gaussian) distribution. The difficulty with such approaches is that they impose a distance on rating values—e.g. modeling power is wasted trying to represent two 3-star items identically even though they may not be identical from the user’s point of view. What is needed is a way to model the one thing that the information that the user is clearly providing—the boundaries between the rating levels.

As early as 1980, this problem of estimating a partial ordering was studied in the statistics community by Peter McCullagh. It has since become known as the “ordinal regression” problem. In ordinal regression, there is a single user and many items. Some items have been rated; the goal is to predict the unobserved ratings. Information about the items is used to generalize from one set of data (observed) to the other (unobserved). This is a simplification of the modern-day recommendations scenario, in which there are many users and ratings are not the only observations upon which we can base predictions. However, to build a house, one must have mastery of plumbing (whether directly or via a subcontractor). Similarly, before one can make best utilize user rating information, one must master the ordinal regression problem. The approach McCullagh took was to learn a mapping of items to points on the real number line. Thresholds (one fewer than the number of rating levels) were used to translate the real values to ratings. A visualization of this can be seen in on slide 9 of my thesis defense slides. McCullagh conducted learning by maximizing a certain cumulative distribution. A mapping and thresholds were learned so that most items fell within the corresponding rating region. I.e. the emphasis was on boundaries between ratings (thresholds), not on penalizing distances between items with the same rating. Unlike other approaches to ordinal regression and rating learning, no distance metric was imposed on rating values. McCullagh’s insights were invaluable. However, he was a bit ahead of his time, both in terms of the applicability of his techniques and in terms of the field’s understanding of how to apply regularization to avoid overfitting the model. But, McCullagh made a key insight which elegantly handles to Netflix VP Jim Bennett’s observation about the variability in how users utilize rating. By treating ratings as separate categories and using thresholds to define the rating level boundaries, the model is flexible enough to handle a wide variety of usage patterns.

While working on MMMF, Srebro and I spent some time studying ordinal regression techniques, including what was considered state-of-the-art at time. We found that the insights that McCullagh had over 25 years ago were still highly valuable, but needed updating. The basic structure he proposed—mapping items to real numbers and using thresholds to define rating levels—has proven highly effective as it encapsulates the structure of the problem without imposing unnecessary assumptions. To understand this, it is necessary to understand that ordinal regression is a generalization of binary classification. The problem of predicting ratings when there are only two rating levels is exactly that of binary classification, the goal of which is to understand the boundary, the difference between items which yield the (e.g.) good/bad labels. Binary classification has been studied extensively in the Machine Learning literature, to the point that fast, highly effective methods are well known. Tong Zhang and Frank Oles give an excellent summary of such modern methods in their 2001 journal paper. Additional background may be found on the linear classifier Wikipedia page. Srebro and I studied extensions of binary classification to ordinal regression. A key observation we made was that “loss functions” proposed for ordinal regression were not a bound on the empirical loss, which was a key insight in the design of modern classification algorithms. So, we developed a loss function which was a bound on the absolute error (the difference between the observed rating and what the algorithm predicted). The results were phenomenal. In experiments we conducted on the “1 Million” MovieLens data set, our approach significantly outperformed the technique that was considered state-of-the-art at the time (2005). The technique was such an improvement that other approaches we tested almost never made a better prediction. *p*-values testing significance of the out-performance of our approach were miniscule: 6e-17 to 7e-22. Using a loss function which was a bound on the error was clearly key to modeling user ratings.

MMMF extends our ordinal regression framework to handle multiple users. They key challenge in extending ordinal regression to multiple users is in determining how best to find similarities between users. Similarities in how users rate allow MMMF to predict unobserved ratings. A popular approach to rating prediction is to use Singular Value Decomposition (SVD) to find a low-rank approximation to the observed rating matrix (where unobserved entries are filled-in with, e.g., user-average ratings). Reconstructing a matrix using the n largest singular values (found via SVD) finds the rank-n least-squares approximation to the original matrix. There are (at least) two problems with this approach: (1) the least-squares loss function models ratings poorly, both because it establishes a distance between rating levels, and it makes the model focus too much on outliers, and (2) the optimization space is non-convex (i.e. there is no well defined solution to the optimization problem). Our insight was to use of a particular regularization term, the trace norm, or sum of singular values of a matrix. Similar to the way the L1-norm has the effect of encouraging null weights, the trace norm encourages null singular values in the parameter matrix. The net effect is a low-rank solution without sacrificing convexity in the optimization problem. Experiments we performed on the MovieLens and EachMovie data sets bear out this observation. We compared versus the nine different algorithms (including an improved variant of the SVD approach) which Ben Marlin tested in his Master’s thesis and found that MMMF substantially outperformed all of the them, on both data sets, according to both evaluation criteria that Marlin proposed.

No recommendation framework is perfect and MMMF is no different. There are always challenges in how to incorporate additional information and make best use of the available information. And, since we are talking about commercial systems, efficiency and scalability are serious concerns.

- Features/efficiency trade-off: additional information about users and items will provide more predictive power, but will also increase overhead and cpu requirements; a trade-off must be struck between efficiency and the number of features to be used.
- Non-linear prediction: with classification, it is relatively easy to convert a linear classifier to one that learns non-linear decision boundaries via the kernel trick. The kernel trick is not easily applicable to MMMF because parameters of both users and items are learned. In the typical classification scenario, only “user” parameters are learned and the kernel is applied in the forward direction to item features. To use a kernel with MMMF, one would need to calculate the derivative of the inverse of the kernel function at every optimization step; for some kernels, this would simply require additional cpu and memory overhead; for other kernels, this is impossible.
- Integrating multiple user judgments: MMMF deals with ratings, but often other user judgments are available, such as purchases or web page visits; StyleFeeder users each have a bookmark list, providing some insight into their tastes; as do many other systems, we currently take a simple approach to this—treating bookmarked items as equivalent to a 5-star rating; however, MMMF can easily be modified to include other judgments as overlaid classification problems using the same set of parameters, incurring minimal additional overhead; we are currently exploring the best way to apply this idea.
- Providing explanations: users like to know how their actions affect recommendations; recommendations based on simple rules provide less accurate recommendations, but can easily be understood by the user; MMMF better utilizes rating provided by the user, but makes somewhat complex decisions that cannot be explained easily; however, the aspects of MMMF which contribute to recommendations are easily isolated; we are currently investigating the best ways to present these to the user.
- “You’ve Got Recommendations”: MMMF represents user-preferences with continuous values, so compared to rule-based systems, it is less clear when a product with a high predicted rating will be a valuable recommendation; some experimentation will be necessary to determine when a predicted rating is “large enough” to warrant highlighting to a user

At StyleFeeder, we use Fast MMMF as our algorithm for recommendations. Besides its ability to predict ratings so effectively, there are a number of advantages it has for a commercial recommendation system. We also incorporate a number of modifications which further improve upon MMMF’s predictive power and enhance the user experience with respect to recommendations. Both the advantages and modifications are described below.

- Representation of users and items as a (parameter) vector of doubles: avoids the need to access the “training” data when making a recommendation or rating prediction.
- Incremental training: any parameter vector (user or item) can be updated individually; user feedback can be incorporated and new recommendations can be made in seconds—once you’ve signed up for an account at StyleFeeder, go to the recommendations page, rate some items, and reload the page; you’ll see updated recommendations in seconds; incremental training also allows learning to be distributed across machines.
- Compatibility with different rating systems: since MMMF treats ratings as a partial ordering, it can easily incorporate different numbers of rating levels (5-star vs. 10-star), even within a single data set.
- Incorporation of features: any information about users and/or items which can be encoded as a vector of doubles can be used to enhance recommendations; at StyleFeeder, we use gender, age, country and sign-up source as user features and tags as item features; this additional information allows us to provide relevant recommendations quickly relative to the total amount of data we have.
- Personalized search: representation of users and items as vectors of doubles allows us to quickly re-rank search results according to the real-valued scores the system generates for each user-item pair; your search results at StyleFeeder are personalized; if you have rated few or no items, your search results are re-ranked according to overall popularity; as you rate more items, the ranking changes according to your taste.
- StyleTwins: the compact user representation makes comparing a pair of users a fast computation; part of our goal at StyleFeeder is to connect people with compatible styles so that they can help each other shop more effectively; we call the top matches for a user his/her StyleTwins.

Here at StyleFeeder, we’ve taken MMMF, an algorithm that has proven to be highly effective at predicting preferences based a user’s rating history, and turned it into a functional, commercial recommendations system. Challenges remain in: gathering and incorporating information on users and items, providing users with explanations of their recommendations, determining the correct cut-off point for notifying users of recommendations, and integrating multiple sources of information. But, we’ve already had a great deal of success. MMMF allows us to take maximum advantage of user ratings, while still providing us with the flexibilities of other recommendations engines. Quick updates allow us to provide relevant recommendations to a new user after only a few minutes of rating products. Personalized search means that users can find relevant products even with generic search queries, such as “shoes”. And, the ability to match up users with similar tastes allows us to provide value to users beyond lists of products. We continue to innovate, focusing on maximizing the value to the user while requiring as little explicit user input as possible.

[…] Blog for those of you who are interested in the black magic that goes on behind the scenes here. Jason’s been doing all of the writing so far, mainly about our CF recommendation engine and Fast […]

My Style twins don’t appear at all similar in tastes?

Why aren’t these updated with our own recommendations and knowledge?

Hi,

Can MMMF be used for vector space model search engine, such as SVD or NNMF (non-negative matrix factorisation) ? IF so, could you please point out any publication that MMMF is used for search engine? I have Google, but to no luck at all.

Hi Ruth,

MMMF can certainly be used for search. Here at StyleFeeder, we use it to provide personalized search results, based on the ratings and bookmarks the user has provided.

By “vector space model”, I’m guessing that you mean a dimensionality-reduction of the term-document matrix. We don’t use MMMF for that, but you certainly could. You’d need a different loss function (e.g. one based on the multinomial model). With that and the trace norm regularization term, you’d get a convex objective which yields a low-rank solution given a sufficiently large weight on the regularization term. Do contact me if you have further interest/questions.

Jason

Hi,

Can MMMF be used for recommendation based on binary data (ie: i like/dislike this item) instead of rating data ? If so, could you point out any relevant publication.

I have trouble to figure that out.

Thank you.

Hi Manu,

Yes, MMMF certainly can be used for binary data. MMMF can be applied to data with any number of ordinal labels >= 2. In both the original NIPS MMMF paper, as well as the ICML Fast MMMF paper, you may have noticed that we used `R’ to represent the number of ordinal labels. You can simply set R=2 for binary data. This should simplify many of the expressions. For my MMMF matlab code, pass 2 for the value of l for the gradient/objective function.

Jason

I had some good success at eBay/Shopping.com using a ranking SVMs. So let me as a few questions?

Is there now an open source implementation that others can use, or do I need formulate this myself. I assume you have a high performance implementation…did you end up solving the dual or the primal? Also, how would you compare using MMMF vs, say, using a combination of SVD + a Transductive SVM (that is, guess the labels, then use a transductive SVM to solve…and, yes, I know MMMF is convex and this is not…but I can do this with existing code). Overall–nice to see this–it is encouraging.