Sunday, April 19, 2009

Temporal factors in Netflix data

Well, as with many other teams, I am now looking at modeling temporal effects in the netflix data set. I'm using some home-grown SVD system based initially on funks, but quickly added some other parameters to make the RMSE sink faster. I'm now at spot 268 with 0.8975 and just re-running my machine to get lower than that with some new additions.

Time is a bit of a hurdle. But I've got some ideas to get this right. The above is based on a model that uses sub-optimal parameter settings after playing around with some other experiments. But after the run I was too lazy to run it again in the proper way.

So... I'm also going to look at user temporal differences. Here's an example of the weirdest user in the set (with the most time on his hands). Interestingly, this user, bot or agent started out with a "proper" average of 3.4055 in 2001-2002, but then suddenly started using Netflix differently. If you look for posts of a psychologist on netflix, you'll understand what I mean. Some guys like to use 1.0 as "movie has no discernible, interesting features whatsoever" ( and blockbusters may well fall into that), such that 1.0 doesn't really mean "bad", it just means not very interesting whatsoever and I switched the tele off. That basically means the user removed the 1.0-3.0 scale for himself and only uses 1.0-5.0 to indicate better than average movies.

+-------+------------+---------+------------+------------+
| count | avg_bucket | globavg | start_date | end_date |
+-------+------------+---------+------------+------------+
| 402 | 3.4055 | 1.9082 | 2001-09-23 | 2002-09-23 |
| 9698 | 2.3414 | 1.9082 | 2002-09-23 | 2003-09-23 |
| 4022 | 1.3993 | 1.9082 | 2003-09-23 | 2004-09-23 |
| 3296 | 1.1271 | 1.9082 | 2004-09-23 | 2005-09-23 |
| 235 | 1.1319 | 1.9082 | 2005-09-23 | 2006-09-23 |
+-------+------------+---------+------------+------------+

So, one of my experiments was actually to re-center the ratings around the user average using some epsilon function. The idea was that the figures in these sets mean different things, so rescaling it to the global semantics sounded like a good idea. Unfortunately, it didn't work out well at all. Maybe I'll get back to that idea later though.

Looking at the results above again and many others, I do see there's some kind of trend in people using the ratings differently. Here's another:

+-------+------------+--------+---------+------------+------------+
| count | avg_bucket | stddev | globavg | start_date | end_date |
+-------+------------+--------+---------+------------+------------+
| 3452 | 3.3566 | 0.6829 | 3.2761 | 2003-09-23 | 2004-09-23 |
| 1968 | 3.1443 | 0.5545 | 3.2761 | 2004-09-23 | 2005-09-23 |
| 191 | 3.1780 | 0.6862 | 3.2761 | 2005-09-23 | 2006-09-23 |
+-------+------------+--------+---------+------------+------------+

Again, a user who's slightly changing the habits of the use of the system. Probably, they think that there's a lot of 4's and 5' at some point, after which 2/3 become more prominent. This takes some time of course. And what about this one:

+-------+------------+--------+---------+------------+------------+
| count | avg_bucket | stddev | globavg | start_date | end_date |
+-------+------------+--------+---------+------------+------------+
| 238 | 5.0000 | 0.0000 | 5.0000 | 2004-09-23 | 2005-09-23 |
+-------+------------+--------+---------+------------+------------+

LOL. We can reasonably assume that people choose movies that they like to watch (so in general, their average should be higher than 3). But this is ridiculous :).

Another interesting thing. Netflix has told us that they've frobbed the data here and there for anonymization purposes. But watch this:

+-------+------------+------------+------------+-----------------------------------+
| count | avg_bucket | start_date | end_date | title |
+-------+------------+------------+------------+-----------------------------------+
| 5 | 4.6000 | 2001-06-21 | 2001-09-23 | Lord of the Rings: The Two Towers |
| 1 | 4.0000 | 2001-09-23 | 2001-12-21 | Lord of the Rings: The Two Towers |
| 4 | 4.5000 | 2002-03-21 | 2002-06-21 | Lord of the Rings: The Two Towers |
| 11 | 4.7273 | 2002-06-21 | 2002-09-23 | Lord of the Rings: The Two Towers |
| 33 | 4.8788 | 2002-09-23 | 2002-12-21 | Lord of the Rings: The Two Towers |
| 651 | 4.7496 | 2002-12-21 | 2003-03-21 | Lord of the Rings: The Two Towers |
| 441 | 4.8413 | 2003-03-21 | 2003-06-21 | Lord of the Rings: The Two Towers |
| 8342 | 4.4839 | 2003-06-21 | 2003-09-23 | Lord of the Rings: The Two Towers |
| 17065 | 4.3618 | 2003-09-23 | 2003-12-21 | Lord of the Rings: The Two Towers |
| 13057 | 4.3979 | 2003-12-21 | 2004-03-21 | Lord of the Rings: The Two Towers |
| 13027 | 4.4752 | 2004-03-21 | 2004-06-21 | Lord of the Rings: The Two Towers |
| 10855 | 4.4567 | 2004-06-21 | 2004-09-23 | Lord of the Rings: The Two Towers |
| 15302 | 4.5075 | 2004-09-23 | 2004-12-21 | Lord of the Rings: The Two Towers |
| 22276 | 4.4670 | 2004-12-21 | 2005-03-21 | Lord of the Rings: The Two Towers |
| 17546 | 4.4310 | 2005-03-21 | 2005-06-21 | Lord of the Rings: The Two Towers |
| 19335 | 4.4759 | 2005-06-21 | 2005-09-23 | Lord of the Rings: The Two Towers |
| 12574 | 4.5484 | 2005-09-23 | 2005-12-21 | Lord of the Rings: The Two Towers |
| 655 | 4.5603 | 2005-12-21 | 2006-03-21 | Lord of the Rings: The Two Towers |
+-------+------------+------------+------------+-----------------------------------+

Lord of the Rings, the Two Towers came out 18-12-2002 or so. Whoops. That's some 50-60 ratings (possibly more) that have dates before the production date of the movie. Thus, any team that's working with temporal effects should take that into account. The first day of rating doesn't mean much. Possibly, these dates should all be set to at least the release date of the movie first. Since the DVD's aren't immediately available on Netflix, some more time would elapse before you could see them there. Not sure how much. The DVD came out August/November 2003, but rentals probably have them sooner, somewhere in between? Let's say March or so, it's impossible to tell. In any case, the rating dates aren't really useful for temporal calculations in that case (not with any kind of precision anyway). Moreover, if you're using production year + ratings, you should thus not assume that all rating dates are after the movie production year.

No comments: