Sunday, May 31, 2009

Ciudad de Mexico

Well, our day started early. I had expected to sleep really long well into the morning, but for some reason it was impossible to get some good sleep. The hotel does offer a great breakfast here in Santa Fé. It's mostly suited for business people. Santa Fé is quite new in Mexico City and is hosting a couple of high-rise buildings to support the high-tech companies that are setting up shop here. Very close by, there's the largest shopping center in MC right now, Centro de Santa Fé. It has a couple of very fashionable and expensive shops around there, e.g. a boutique of Ermenegildo Zegna, Armani, just to name a few. Anyway, after breakfast, we managed to get a taxi sorted and drove through MC on the avenue of Reforma to get to the city centre. Reforma and just before Polanco, the area in the city with fancy restaurants, expensive clubs, etc., is also the avenue with a lot of houses for government officials, embassies, you name it. So the expensive house or two can be seen on the left and right. Just after this area is the park with an auditorium and Chapultepec. Due to time constraints, we haven't been able to visit that. We arrived soon on Zolaco. It's a large plaza in the middle of the city with already some of the most well known buildings around. So take out your camera and shoot. The golden hour had just passed sadly, so lighting was already creating some difficult situations. Right on the Zocalo is the Catedral Metropolitana, the oldest cathedral of the Americas:

It's impressive. The glass in the windows is not leaded or coloured, which is a bit of a shame, otherwise it'd be great to bring a tripod next time for an High Dynamic Range session. The altar and other views are equally amazing.

The other buildings around the square are very old, this area of the city is very appealing and certainly to be recommended. You do get the occasional nagger for taxis or other tourist stuff and possibly there is some pickpocketing going on, but I don't get the idea the area is unsafe. Matter of fact, you do see a plenty of mexicans, other tourists walking with their camera in full view, walking around this area of town. Police is also in abundance.

Well, taking a little stroll from the plaza towards the west, we're passing some very old houses and streets. A tower called Torre LatinoAmericana is in the city, which goes up to 43 floors or so, built in 1957. It isn't the prettiest tower in the world :) (don't mention this blog whilst I'm up here), but it provides a superb view of the city. This also shows the immense size of it. Wherever you look, the faintest houses in the distance fade into the horizon mist. Mexico City is huge and the metropolitan region is roughly the same size of that in Sao Paulo.

And standing on top of that tower, we spotted another beautiful building just next to us, the palace of fine arts, an opera house. A place of great architecture, something to definitely see in more detail. Since we're trying to see as much as we can, we've decided to come back another era.
So we moved on through the park towards other places. I've enjoyed that little walk. There are plenty of shops, restaurants and things on the sides to grab some water or some food. Other than that, it's quite stereotypical, although I found this park to be nicer than many other parks I've seen in other metropoles/cities.

At the other end of the park was a taxi stand and since we needed to prepare for the rest of our journey, we decided to check out the shopping center in Santa Fé. This is the largest in the city now and very well developed. There are three floors spanning the entire area with everything you'd generally expect in a mall, so I won't digress too much.

We'll be travelling out to Cancun tomorrow, which I think is a bit of a shame. This city offers so much to visit and see. And if you visit the really interesting villages and temples around Mexico City as well, I'm sure it's possible to stay for at least a month. Anyway, not to worry. Can't complain about Can Cun, Can you?

Saturday, May 30, 2009

unas papapas, unas pepeiras


Just arrived in Mexico City after a long 11hr flight. This was taken on the way from the airport to the hotel. Tomorrow I should have time to go around the city, take cool pictures. Tonight, just getting some good food from the restaurant over here, turn on the tele, enjoy some wine and get ready for the day tomorrow.

Haven't seen face masks around the city here, only paranoid tourists just arriving at the airport use them. Some general precautions are necessary to reduce chances of infection from the virus:
  1. Don't stay around sneezing people that look sick :). Don't get into crowded areas or public transport.
  2. Wash hands regularly.
  3. Don't touch your face with your hands when going through public places.
  4. Eat healthily and loads of vitamins. Make sure to sleep properly.
Weather in Mexico is cloudy at the moment. Actually, it's better weather in Holland at the moment... The temperature is 27 degrees or so and constant, so that's cool. Not as hot as Brazil however, plus that Recife is seaside and this is more into the country. The city is huge here, there are not too many high-rise buildings, so just like London, the city expands across the country in ridiculous proportions.

Tuesday, May 12, 2009

Why RBM's are so strangely weird

I'm getting quite obsessed by RBM's for some strange reason. There's a very strange simplicity to the RBM, a very elegant method for learning through contrastive divergence and a very strange ability for an RBM to model many things. The current science shows and understands that RBM's certainly have limitations, but here we go to try to expand on that.

An RBM is a very strange kind of neural network. Artificial neural networks the way we know them generally work the signal in a forward direction, but RBM's work in a forward and backward direction. In a sense, you could say that it's a little bit similar to our minds in that, when we observe something, we both use the details from the input signal to enrich the actual observations, but at the same time use the information from our experience to enrich or expect what is being observed. I reckon that if we were to only rely on the observed state, that state wouldn't nearly be as rich as our mentally induced state, which blends our experience with our observations.

Here's a video that might blow your mind or not... It's a presentation from Giulio Tononi, which I found very compelling. In this theory, it's not a search about the quantity of neurons required to become conscious, or the localization of consciousness within the brain, but it's more of a theory of the most effective organization of neurons within a network for such a network to exhibit consciousness. (text)

Here's where I paid huge attention. Apparently, having a network that has all neurons connected together is total crap. And a network that is very large and has a high number of local connections can be good at something, but it doesn't have the right properties for consciousness. The best thing is a network with specialized neurons, connected in patches, with long connections now and then to other parts of the entire mesh. Much of the work there is related to quantifying consciousness. By quantifying consciousness, and if this quantification is in step with actual consciousness, one can continue to search for more effective methods of building neural nets or machines.

The property about "patchy-ness" suggests that blindly connecting neurons together isn't the most effective way to build a network. A highly regular network makes any system act like a general on/off machine, losing its specificity of function. Neurons that are not connected enough make it work like having a number of independent classifiers, which isn't good either.

Most NN's and RBM's build their theories around having x number of neurons or elements connected evenly together with other layers and then calculate a kind of "weight" from one element to another. Putting more neurons into a certain layer generally makes the network more effective, but improvement is generally asymptotic.

I wonder whether it's possible to develop a theory, complementary to the theory of the quantity of consciousness, which perhaps as some derivative allows a neural network to shape the network itself, or whether such theories provide better rules for constructing networks. One good guess would be to do observations of biological growth and connection-shaping of a brain or simpler parts and then assess the patterns that might be evolving in the generation of such a network.

Finally, the most interesting words of the hypothesis:

Implications of the hypothesis

The theory entails that consciousness is a fundamental quantity, that it is graded, that it is present in infants and animals, and that it should be possible to build conscious artifacts.

This is a huge implication. And in order to understand it, one should go to the start of this post. Consciousness == experiencing things. As said before, it means that our observations carry detail, which are processed by itself, but which are also completed by previous experiences. Thereby, our actual experiences are not just the observations we make, but a total sum of those observations plus memories, evoked emotions, etc. In a way, you could say that what we observe causes us to feel aroused, or have some kind of feelings, and seeing similar things again at a later point in time might cause us to see the actual observations + previous experiences (memory) at the same time. It's very likely that not all experiences are actually consciously lived, in the sense that we're aware of all possibilities of experiences that we could actually experience, very likely there are many experiences just below the surface of consciousness as some kind of potential or stochastic possibility, waiting to be activated by changes in the temporal context.

For example, rapid changes in our direct observations can cause instant changes to behaviour. This implies that next to observing the world like a thought-less camera, consuming light-rays and audio waves, we're also experiencing the world as a kind of stochastic possibility. The easiest example of demonstrating this is the idea of movement, of intent, of impact and likely effect.

The phrase: "I'm standing at a train station and I see a train coming towards me" contains huge amounts of information. The recognition of the train in the first place, the experience that it's moving towards you by the train becoming larger, the knowledge that the train runs over tracks that you're standing next to, the knowledge that train stations are places where trains stop and your intent to get on the train. Just telling here how much knowledge we apply to such a simple situation demonstrates how we're accepting our consciousness as the most normal thing on earth, which it certainly is not.

Well, so why are RBM's so strange in this sense? Because old-school neural networks don't have these properties. RBM's can both recognize things, but also fantasize them back. There are certainly current limitations. In previous posts I've talked about consciousness that we shouldn't perhaps limit the definition by "consciousness == when humans think or experience". When maintaining a broader definition of consciousness, one can also consider machines or A.I.'s which are extremely effective in a very particular area of functioning and might just be consciousness in that relevant area without having any kind of consciousness of things around. The definition of consciousness here however is a dangerous one, since it shouldn't be confused with behaviour, which it certainly is not.

Food for thought...

Sunday, May 10, 2009

If You Liked This, You’re Sure to Love That

An article was posted in the NY Times about the Netflix prize some time ago. It fired some new ideas that I'm investigating right now. It's related to association rule mining.

A bit of lore in data mining is the beer-diapers story. Fathers buying diapers on Thursday or Friday night also bought beer for over the weekend. This apparently caused some supermarkets to put diapers and beer closer together. So far for the lore. In every basket, there are regularities with a certain probability. If there's salad in the basket, there may be tomatoes too, or baskets with milk have a high probability that there are cereals for example.

The association rules for basket research are based on nominal properties: It's there or it's not. The problem of Netflix is more of a continuous problem... It's the question about whether there is a strong correlation, or implication with a level of confidence between ratings of different movies and each movie has five different possibilities. Also, there may be a strong correlation between 1.0f on one movie and 1.0f on the other, whilst there may be little evidence on A@5.0 -> B@4.0f. So, per rating, the support and confidence may differ. What we're trying to find for each movie/rating combination (which is 17,770*5*5*17,770*size(datatype)).

movieA@ratingX -> movieB@ratingY

The idea is also to filter out those correlations that don't have strong correlations. The biggest problem in association rule mining is memory. Because the matrix of 17,770 movies x 17,770 movies x 5 ratings (and sometimes x5 again) is a very large one, you can't really use data types that consume 4 bytes of memory, because that requires 8 GB of memory. So in my current implementation, I squeezed in a float into a single byte, thereby losing lots of precision, and then popping it back in the prediction phase. Since I got 4GB, I can just about run this.

Association rules are highly dependent on frequencies of observation. And it's necessary to take into account the frequency that movies are seen together versus the frequency that a movie is rated in isolation. The higher the ratio that movies are seen together and the stronger the observed rating combination between A and B, the better the predictive strength of the rule.

So, let's say user A rated movie 1 at a 1.0. If he also rated movie 2 at a 1.0 and there are many more users that did this, it means that the rating of movie 2 can be strongly derived from this association rule. It's also possible that many users rated 5.0 for movie 1, but a 4.0 for movie 2. Or rated 4.0 for 1 and 5.0 for 2. These correlations all suggest slightly different things. What I really wanted to do was store the confidence of these rules, but memory doesn't play nicely here and I had to think of something else.

In a sense, association rules are similar to clustering. The differences are mostly in the derivation of strong correlations between some movie A and movie B, whereas in clustering that is never determined. knn and clustering at the moment do produce better results. I'm getting 1.47 for example vs. reported results of 0.95-ish for knn and clustering approaches. Those approaches go down as far as 0.920 with improvements.

The imprecise-ness is very likely caused by incorrect filtering or imprecisions due to the memory space shortage and float compression. Pseudo-code for now looks as below. I've basically loaded all ratings twice. Once grouped by movieId (which customers rated this movie at what?) and the whole set again grouped by customerId (which movies did a customer rate at what?). The most important and wasteful array is the one that maintains compressed floats into unsigned chars (so losing loads of precision there). It's dimensioned by MAX_MOVIES, MAX_MOVIES and NUM_STARS. This means that given movie 1 in the first dimension, find out if there's a strong correlation factor for movie 2 in the second dimension, given that movie 1 was rated at x stars, where x is then another index into this huge array. The retrieved byte is then divided by 63 and 1.0f is added. Thus, each predictor is compressed into a single byte, reducing precision, but allowing everything inside a 1.5GB area of memory or so.

Having the training data twice allows one to access data very quickly. A problematic issue is how to store calculated confidence for each movie.

start:
clearMovieResultArray

for all movies i {
clearMovieConfidenceArray
clearMovieCountArray

for all customers c in i->ratings {
for all other movies k in c->ratings {
if k == m continue;
support[ k->movieId ][ i->rating[c] ][ k->rating ]++;
count[ k->movieId ][ i->rating[c] ]++;
}
}

for all movies m {
if ( freq-movies-rated-together / i->ratingCount ) <>

for all possible ratingstars j (NUM_STARS) {
for all possible ratingstars k (NUM_STARS) {
// Normalize movie support.
support[ m ][ j ][ k ] = support[ m ][ j ][ k ] / count[ m ][ j ];
float sum = 0.0f;
for ( int k = NUM_STARS-1; k >= 0; k-- ) {
support[ m ][ j ][ k ] = support[ m ][ j ][ k ] / count[ m ][ j ];
sum += support[ m ][ j ][ k ] * (k+1);
}
}
MovieResult[ i ][ m ][ j ] = (unsigned char)((sum - 1.0f) * 63 );
}
}
}


predictRating( short movieId, int custId )
{
for all movies m that this customer rated {
if support exists {
genRating = ((float)MovieResult[ m->movieId ][ movieId ][ m->Rating ] / 63 ) + 1.0f;
sum += genRating;
numSupport++;
}
}

if ( sum > 0.0f ) {
sum /= numSupport;
} else {
sum = ALL_DEFAULT_RATING;
}

if (sum > MAX_RATING) sum = MAX_RATING;
if (sum < sum =" MIN_RATING;

return sum;
}

Wednesday, May 06, 2009

Tikhonov Regularization

This post is about Tikhonov Regularization, which is used for statistical and predictive method. For systems where the predictions are nicely along the real values, like so:


Things are swell, because the predictions are on either side of the line. Basically, you have a good chance of finding a very good solution for this, as it looks like a well-posed problem. The (re)estimations are very regular positive and negative, yielding very good values for a linear regression of your found values, such that you can predict new values which you haven't initially trained for (for which you know the actual value).

Well, the netflix prize is facing an ill-posed problem. This basically means that there are a number of different solutions possible, or there are cases where the differences become so small, that it generates numerical instability when calculating that solution. I've basically seen this happen in some of my implementations when doing the blending.

So, what's done is that a regularization factor is introduced, which is giving penalties to larger factors, thereby conditioning the problem better (reducing the number of solutions). See an example what part of a graph might look like:

Okay, so in this case, we may have a prediction function which is pretty complicated and for a large number of values, which are actually connected to other dimensions, it's looking like this for part of the range and domain. If you'd grab this part of the graph and then try to refit this with different approaches into a better estimation overall, then chances are very high you'd have numerical instability here. Notice how many black points are below the line. The red line (the real values) are way above that. Even though the distance between the point and the line aren't necessary disturbing, when trying to do a regression on this sub-space of the graph, you'd end up with some very bad factors. It may be better to just follow the red line along the bottom a bit, because if you'd regress towards different factors, you'd get the same problem elsewhere along the road.

Now, for a little bit of philosophy, read Occam's razor. This basically states: "The simplest explanation for a phenomenon is most likely the correct explanation".
Other methods for inferring evolutionary relationships use parsimony in a more traditional way. Likelihood methods for phylogeny use parsimony as they do for all likelihood tests, with hypotheses requiring few differing parameters (i.e., numbers of different rates of character change or different frequencies of character state transitions) being treated as null hypotheses relative to hypotheses requiring many differing parameters. Thus, complex hypotheses must predict data much better than do simple hypotheses before researchers reject the simple hypotheses. Recent advances employ information theory, a close cousin of likelihood, which uses Occam's Razor in the same way.
Thus, it's better to use models with fewer parameters than it is to use models that are using many, many different parameters. Find those models which describe the problem well, then merge them together.

Back to Tikhonov... If you're also a contender in the netflix prize, check out the following source, which is heavily chopped down in complexity, improved in readability and derived from the source of linalg, which was developed by gravity. This source is not likely usable without the use of their linalg source codes, since it uses PseudoInverse.cpp for example:

Link to main.cpp

The regfactor and regfactorbias also exist in linalg and those are key towards bringing down the rmse. Remember that we potentially have numerical instability, which is why we want to introduce regularization. The regularization conditions the problem, after which better solutions are found. Better solutions provide better fits, reducing error. You'd prefer these regularization terms as close to 0.0f as possible, because when this reg term goes towards 0.00f, you'll go towards the unregularized solution of least squares regression. It'd be great to get there, but the problem itself is ill-posed, so we need some regularization at least.

I did rip out the fancy parts like cross-validation (used to find optimal values) and rmse reporting, such that you can follow the code better. In the pseudo-inverse, it's basically executing the linear least square solver.

The returned array is an array of weights, which are used to multiply with a bias of 1.0f, another weight for algo1, another for algo2... If you like, you could easily add other algorithms and this thing will give you reasonable weights in return.