Monday, March 02, 2009

Chasing the Netflix prize

As I'm naturally interested in algorithms, especially when these are related to Machine Learning or Artificial Intelligence... I'm posting some results here from my research into the Netflix prize so far. My results so far are not very impressive, but I'm gaining lots of new insights into the behaviour of Machine Learning algorithms and the relations between parameters, what they do. You can see on the left an attempt at lowering the Root Mean Square Error, or RMSE. That is the key feature to reduce in the Netflix prize. RMSE is calculated as follows:
  1. Predict the number that a user would rate a certain movie with one decimal accuracy.
  2. Compare the predicted number with the real rating that we know (training data).
  3. Multiply the error with itself (square) and add that to a sum.
  4. Divide the sum of the squared errors by the number of predictions made.
  5. Take the square root of that sum.
Netflix initially had a performance of 0.9514 rmse. Their challenge is to improve this by 10%, thus 0.8563 rmse.

In the picture above, you can see a blue line, which is a very potent reduction of rmse on the training set. It's actually a perfect logarithmic regression, and it follows that line perfectly. However (and I don't have sufficient probe points in red in the graph to demonstrate this), the performance on a set of data that is unknown (a probe to test against or the qualifying set) is gradually decreasing. Meaning, for data that is unknown to the algorithm (the predictions to be made) are getting worse, much worse. That means that the rmse for the probe and qualifying set is getting larger as the algorithm progresses. Thus, the training set is overfitting the parameters very heavily, yielding negative returns for the other data sets.

Looking at the projection of the red line, the rmse of the separated probe (known ratings, but excluded from the rating set) is not converging at some point. My conclusion is that I need to think of something else.

In some of my later iterations that I'm not showing here, it's interesting to see how the performance of the dataset is logarithmic, whilst the performance of the rmse on the probe/qualifying set is neither linear nor logarithmic. The start of the probe rmse is sort of linear, but possibly will become logarithmic after some time. At the moment, I'm at position 1120 on the leaderboard, but I've only touched the surface of this problem and I submitted those results based on a Matrix Factorization algorithm (MF) with only 16 features. As a comparison, some other people have submitted results using over 600 features, calculated over 120 epochs (runs for one feature) or so.

No comments: