Tuesday, November 17, 2009

Sane Value Decomposition

Oh no! Another article about SVD! Well, this one is not specifically about SVD, but it is describing some of my theories about parametrization of SVD and why so many people get different results and possibly why I didn't win! :).

To explain my theories, we need to first understand SVD in some reasonable way. The SVD formula goes:
A = U Ʃ V

Which means that the above (for the purposes of this explanation) square matrix is decomposed into other equally sized matrices U, Ʃ and V. The U and V are orthogonal matrices, the Ʃ matrix contains the so-called singular values.

More intuitively said, the U matrix is a depiction of the actions of the rows in the matrix, whereas the U matrix is a depiction of the columns. Multiplying the values in a row of U with a column in the V transposed matrix (V.T), multiplied by the value in the diagonal of Ʃ that corresponds to the row,column selected, you get the value of the original cell in A. So, svd on a matrix A gives a strict representation of the values in A, but in a different 'decomposed' way.

But we don't have to go all fully-dimensional, because having all matrices around means we need three times more storage than the original one, and that's not the point. So the idea is to reduce the dimensions of each matrix and keep only the first x around. x however should be a reasonable percentage of the rank (row/column size), because if x is too low relatively speaking, you're not getting good results. I will show how this all works in more detail using python, a language that's very quickly starting to become my favourite language for experiments!

Example:

This is a square matrix with random values between 1-5 of n x m = 5 x 5 dimensions. So 25 numbers in total.
[ [3,5,2,4,4],
[3,2,5,5,3],
[4,4,5,4,3],
[2,3,4,5,2],
[1,5,5,2,2] ]
One thing I discovered (which many sites probably mention in not so readable terms :), is that when the matrix becomes relatively large, the first singular value on the diagonal approximates the average multiplied by the square root of the total numbers available in the matrix ( in the case above, the average is 3.48 ).
Ʃ(1) = avg * sqrt( number_of_cells )
The second singular value for larger, irregular matrices is approximated by:
Ʃ(2) = sqrt( 2 * Ʃ(1) )
So there certainly is a pattern being generated. Each singular value in the decomposition is actually a gain factor. So, if the matrix is really, really large, the singular values become in the orders of 1000's and the values in the first columns of U, V become 1/1000 in the case of this particular range for the Netflix data ( 1-5 across the matrix ).

Here are the actual singular values when the matrix is decomposed:
sigma = mat( [ 17.76224954, 3.5139895, 3.16968094, 1.60407331, 0.73105457 ] )
And here are U and V:
mw = mat( [[-0.44379232, -0.33273609 ,-0.80072792 , 0.11314317 ,-0.19587877], \
[-0.4616133 , 0.58960592 , 0.15588499 ,-0.08007446 ,-0.63919166], \
[-0.5053614 , 0.05135939 , 0.03629005 ,-0.69454331 , 0.50819749], \
[-0.41956887 , 0.28391073 , 0.09075436 , 0.69668699 , 0.49974748], \
[-0.39816248 ,-0.67705869 , 0.57007135 , 0.11412068 ,-0.21225762] ] )

cw = mat( [[-0.33638547 , 0.24667399 ,-0.32741104 ,-0.73031105 , 0.43063272], \
[-0.47365356 ,-0.80039862 ,-0.13379567 , 0.1795791 , 0.29131499], \
[-0.52873629 , 0.082443 , 0.81168849 ,-0.18044758 ,-0.14913599], \
[-0.50662787 , 0.5372694 ,-0.21592459 , 0.61450025 , 0.17445861], \
[-0.3553354 ,-0.05530591, -0.4116298 ,-0.15564457 ,-0.82280841] ] )
Notice how in the U and V matrix, the numbers are maintained at roughly the same magnitude throughout each dimension (you need to read rows and each column further to the right is one dimension deeper). This can be done because if you follow the implementation of SVD correctly, the singular value is the gain factor. So, choosing the correct 'gain' is essential for proper functioning of the model, or you need to adjust the parameters in each dimension.

What some people did in Netflix for example is to assume a gain of 1.0f (the singular values were not used) and then initialize each dimension to some 'low' value. Naturally, you can see that these 'low' values are actually really low in the beginning, but once the model starts to hit other dimensions, you may suddenly hit a gain factor that is above the singular value of the actual matrix Ʃ. That means the model becomes immediately instable and will start producing very high values.

Let's look at a regular matrix:
[[4 4 4 4 4]
[5 5 5 5 5]
[3 3 3 3 3]
[2 2 2 2 2]
[1 1 1 1 1]]

U:
[[-0.54446573 -0.21442291 0.76382282 -0.25982498 0.08152026]
[-0.65208362 -0.46955287 -0.55122106 0.17692309 0.13842191]
[-0.3619848 0.49755586 0.15572615 0.71006801 -0.30489008]
[-0.31846257 0.36154464 -0.26944015 -0.60360543 -0.57526478]
[-0.21422565 0.59604242 -0.12602146 -0.1807017 0.74182633]]
sigma:
[ 17.71717246 1.04811144 0. 0. 0. ]
VT:
[[-0.53913501 0.50275905 0.61561055 -0.13109677 0.24577241]
[-0.40915723 -0.26153467 0.04808419 0.872681 -0.01748586]
[-0.35699995 -0.56822929 -0.1744093 -0.31507809 0.64805378]
[-0.37793419 -0.44513203 0.23939904 -0.33777929 -0.69829541]
[-0.52119151 0.39724791 -0.72868447 -0.08872685 -0.17804491]]
Whoops, what happened? We now only have 2 dimensions left, but after the third dimension all values are 0? This is because the matrix is so regular that it can be described perfectly with two factors. Hmmm.... So if there is good regularity in the data, then we need less numbers to represent the matrix! Of course, the above is taken to the extreme, but looking at recommendation systems, there are certainly regularities:
  • Some users are very constant in their ratings and generally use a scale of 4-5 for everything.
  • Nobody wants to rent bad items, those are generally avoided, so the likelihood that someone gets an unexpectedly bad item is relatively low. That means that relatively speaking, ones and twos occur less often than 3-5.
In essence also, you don't get a lot of information if rows or columns only contains 5's. The information is in the irregularities actually (or temporal trends, but that's another story! :).

More on the theory... Suppose we stick with the irregular 5x5 matrix above. It's basically random. I wanted to find out how fast I can converge to this 5x5 matrix using both some clever initialization techniques and gradient descent. (gradient descent is where you assume some random factors mw/cw, then calculate using those factors your prediction, compare this with the real number and adjust the factor by the error that you get from the comparison. This way, you can "train" the decomposed matrices U and V individually without having to store the entire matrix A in memory (which you don't have anyway, except for a small percentage of, say, 1% of it ).

When you look at the matrix, in the first feature / component that you calculate, you need to make a jump from 0 to somewhere in the 1-5 region. And all numbers 1-5 are positive. This is good, because you know that either the features need to be both positive or both negative. Make your choice. A good approximation for each feature further on is the sqrt( average ) of each respective column and row. This is because in the final calculation, the value in the cell is hopefully already quite well approximated by:
Aij = Ri1 * Cj1 * Ʃ1
In practice, for slightly larger matrices and those that are totally random, the root mean square error of each estimation vs. the real value will be around 1.41 or so. Of course, this is a function of the range used in the actual matrix A. It also shows that if the matrix is highly irregular, you're not converging very quickly to an error rate where your model becomes useful for making nice approximations.

Learning matrices is a tricky operation. Assuming the matrix from above, which has full information available, is square and relatively easy to learn, if the learning rate is too high, the model becomes unstable and the RMSE increases at some point exponentially. Ideally, you start from a good initial point, so that the gradient descent only needs to settle to the minimum, but this is only possible in the start situation. A good initialization of a model uses the following formulas:
  1. Calculate GLOB_AVG from entire matrix
  2. Ʃ1 = GLOB_AVG * sqrt( num_cells )
  3. Ri1 = sqrt( Ravg / Ʃ1 )
  4. Cj1 = sqrt( Cavg / Ʃ1 )
  5. lrate_first_cycle == slow (it's almost there) == 1 / Ʃ1
  6. INIT = ( GLOB_AVG / ( 2 * sqrt( 2 * GLOB_AVG * sqrt( num_cells ) ) ) )
  7. Rix (x>1) = rand( -INIT, INIT )
  8. Cjx (x>1) = rand( -INIT, INIT )
  9. lrate_other_cycles == depends on the data :)
The latter comment was made, because I don't have sufficient empirical insight yet into how the learning rate should be calculated. It also depends on the objective. If you have full data available and you just want to deconstruct quickly, then you might want to go for fast and high learning rates. But for others, accuracy is more important and not having too many factors lying around. Then a fast enough, but much slower learning rate is more appropriate.

I'm still looking into data inside matrices that may be imbalanced. E.g., having 17000 ratings on a single row vs. sometimes 1-5 ratings in a column. The problem here is that a single row is updated 17000 times vs. an update frequency of 5 for a column. Obviously, this might give rise to a very quick imbalance in the model and maybe trying out different learning rates, one for columns and one for rows may be an answer. Some practical experiments there however have shown that this further destabilizes the model... so what can you do?? :).

1 comment:

zubehör said...

The SANE Virtual Processor (SVP) is an abstract concurrent programming model that is both deadlock free and supports efficient implementation. It is captured by the μTC programming language. The work presented in this paper covers a portable implementation of this model as a C++ library on top of POSIX threads. Programs in μTC can be translated to the standard C++ syntax and linked with this library to run on conventional systems. The motivation for this work was to provide an early implementation on conventional processors as well as supporting work from programming FPGA chips to Grids.