Tuesday, October 27, 2009

Drawing your country with openstreetmap

In some previous project, I worked with GIS data to show demographic data on Google Maps. The underlying database is PostGreSQL. This post shows how you can use (part of) the database of OpenStreetMap (verify license!) to extract features from the dataset to paint this onto an SVG or PNG image. Using the PostGIS database in combination with the loaded data, you can extract features (these are like gmaps overlays, or GIS layers) and include them in the image. You want to include railroads and regular roads? Not a problem! You only want to include waterways and the general shape of the country? Can do! The image above was constructed by taking water, waterways, train railroads and forest areas of Holland. Then I zoomed in a bit to show the level of detail.

Since I mostly use Ubuntu, I'll explain the steps used to get things done.

Prerequisites:
  • Postgres 8.3 + Postgres 8.3 server-dev package
  • Postgis
  • cmake
  • qmake
  • libqt4-dev
  • bz2-dev library
  • geos library
  • gdal libraries (libgdal, gdal binaries *and* libgdal1-dev)
  • libxml2-dev header files and library
  • Benelux or other data: ( http://planet.openstreetmap.nl/ ) (no need to unpack! 170M packed, 1.8G unpacked )
Then, you probably need to download the SVN version of a utility called osm2pgsql. This utility allows you to load in the Benelux data into your postgis database. You can get it using:

svn co http://svn.openstreetmap.org/applications/utils/export/osm2pgsql/

Then, make and make install. This probably doesn't copy the default.style across from the svn checkout directory to where it is expected. So:

# mkdir /usr/share/osm2pgsql
# cp default.style /usr/share/osm2pgsql

Now, you're ready to load the Benelux data:

osm2pgsql --slim -c -l -d planet-benelux-latest.osm.gz

Ok, so let's start visualizing this for a bit. You'll want to get qgis, compile that, install it, run it and then connect to your database:

# svn co https://svn.osgeo.org/qgis/branches/Release-1_1_0 qgis_1.1.0
# cd qgis-1.1.0
# mkdir build
# cd build
# ccmake ..
(verify output, resolve errors). Press 'g' to generate scripts

Then you should be able to run qgis from the command line: qgis. This starts up the application. If you look for a couple of ESRI shape files, you can load them up and play around with them to see how things work. For the open street map data we have downloaded, we can connect to the database using "Add Postgis Layer". This allows you to select the host, database and tables to load in. It takes a while to get data out of the database, but eventually you get there and it shows all of the Benelux in a particularly bad colorful display :).

A better way to get your data out is by using multiple layers, instead of one. There are a set of utilities that can be useful to load ESRI shapes into the database, but also to get them out. Since we can use where statements, disjunctions and conjunctions in SQL, it is simple to pick out what you're looking for, put this into a shape file and load it into qgis for visualization. Note that the polygons are great for showing a bit of volume and color and that lines are more useful for borders:

pgsql2shp -f forests "select osm_id, landuse, "name", way from planet_osm_polygon where landuse='forest'"
pgsql2shp -f water "select name, way from planet_osm_polygon where \"natural\"='water'"
pgsql2shp -f waterwegen "select name, way from planet_osm_polygon where not waterway is null"
pgsql2shp -f railways "select name, way from planet_osm_line where railway = 'rail'"

These exported shape files can now be imported into QGis. Then adjust the properties per layer, specifying pen for drawing a black line around polygon areas or not (I didn't in this case). The lines only use a pen and don't have fill colors. Match wood / forest areas with green and water with blue, then select particular types for roads and railways and you're set!

Using the Print Composer function in QGis, you can now export what you've been creating. Make sure to use the Add Map button first, then export to either SVG (Inkscape?) or PNG.

Here's a page that explains the features in the database, but note that the features are not consistently used throughout the database, so you should always check your results:

http://wiki.openstreetmap.org/wiki/Map_Features

More detailed zoom-in near Amsterdam area using the same restricted set of layers and a comparison with Google Maps:


Notice how Google maps have painted areas in a more generalized way, whereas the openstreetmap image is still the original ESRI format. It should not be very difficult to start painting images of GIS data in a format like the Google one above, then overlay the original high-resolution data over those images to indicate the position of cafés, cinema's, etc...

Good luck!

Thursday, October 22, 2009

Reasoning with(in) language

Natural language processing is (eventually) very much related to understanding what is being written and (re-)recognising words in the particular semantic contexts that apply. After playing around with the NLTK for a while, I come to realize that the toolkit is much geared towards analysis of specific texts, or helps in defining a EBNF representation of a particular grammar such that a particular category of text can be parsed more successfully or specific parsers / analyzers can be researched. There isn't any mention I've seen where a generic classifier (NP/VP/DET) exists that understands what words are for, more or less like an incremental learner that just goes along and finds out what goes where. Discovering the requirements for such a process is the real question. What makes up language? Why can language appear so fluid and in so many forms and how come we recognize language so very quickly after the utterance, even though that particular utterance has likely not been spoken before? Or is there indeed a difference in the speed of interpretation of familiar utterances versus non-familiar utterances? That is a very interesting research question. For now, I'm thinking up whether there are ways to discover the semantic information of words automatically, or possibly let the computer express to us that something could not be recognized or didn't make sense, so that we could tell the computer how things actually worked.

One thing that I find is going to take up a lot of time is to explain to computers how the world works. The advantage we have is the number of senses, which tells us a lot more of the combinations of observations, how these join together (specifying the situation in greater detail) and the possible consequences that could ensue (whether it's danger or normal, etc). Our language also contains words that allow us to express the particular information about observations of the senses into great detail.

What is needed for computers to start learning is to define a generic model for the world, such that objects (instances of types or classes) can interact with other objects (thereby classes), so that by observation of a particular instance, the computer must be able to generalize that instance towards a class or a class higher up the tree, continuing to find consistencies or inconsistencies in that existence. This yields a number of questions that require the computer / agent to research them for truth or false. The result could be that the class that was initially constructed isn't entirely valid, or the instance belongs to a totally different class that was not known until that time.

Most models I've seen revolve around recognizing a stream of information that refers to elements in the world and a pre-designed reasoning model that the computer uses to use those observations appropriately. But that requires up-front design and then rules out the learning of computers more or less automatically. What designers mean with "learning machines" sometimes is not learning new concepts, but most of the times it is related to learning to recognize particular situations such that the corresponding actions (which are finitely defined in most computer programs) may be chosen.

Well, I have no idea about what this model should look like, but there is no reason why it could not look like a generic model setup for multi-agent worlds. That particular situation has relations, which describe possible relations that one entity/instance could have with another and in what way. Then there are functions, which could be thought of as actions or manipulators of instances, functions that do stuff to entities. Of course, these functions shouldn't directly issue a particular action, but only manipulate the object "in the mind's eye". So there's a clear distinction between observation and the observations made about objects in the real world and this real situation vs. the representation of those objects in working memory. Only when conclusions made in working memory make sense should the computer start a 'world action' by invoking some kind of control function.

The picture above was inspired by the action of 'pointing'. Looking at animals in the kingdom there are no animals that actually point to things to teach something to others (well, disregarding the pointer dogs :). So pointing is very specific to the human learning experience and probably the one that has allowed us to learn in the first place. Pointing is about attracting a person's attention towards something that is happen, since it is considered important for learning or for other people to start taking some kind of action. So it's about teaching, but also about group behaviour and a simple way of communication. It is quite remarkable that no other animal has developed this particular trait, being such a simple one.

The expectation for many programs is that they are complete in the sense that basic functionality should work once things are bootstrapped or "learned". With that I mean that the capabilities of the program are pre-determined and programmed in, it only needs to find out when to execute those actions, which is generally determined by inspecting a pre-determined stream of information and then either learning after which sequence of patterns it needs to invoke which action and so on.

It is quite interesting how we neglect the fact that we could teach the program things as well by telling it. And even more importantly, how little research there has been so far about computer programs determining that their knowledge is insufficient to solve the case at hand and for methods to complete that knowledge such that it can be executed in the future. That to me, means that we're still thinking of the computer program as a complete specification with prior requirements that is executing a particular job.

Saturday, October 17, 2009

Natural Language ToolKit

Due to my interest in the z-machine, I'm looking at natural language parsing and the complexities that entail it. There's actually a very nice python-based project that allows one to study natural language processing, it's called nltk. NLTK is more like a set of tools for getting frequency distributions, frequency plots, extracting information, processing raw text, etc. Basically, within a single line of text you can specify a lot of characteristics about the text or words you're interested in, then you run a function from words within another selected set and you get the results you're looking for (at least, that's the idea). There are a lot of Python functions and objects prebuilt into the toolkit offering a lot of generic tools that you'd typically use and to which you can feed specific sets of data for processing or parametrize with your specific intention. It's probably not that effective immediately for a particular application, but this is research and first you need to find out what to do before you start off working on some solution that you think might work. It's all about getting really deep into the matter very quickly, experimenting things, looking at results. I haven't worked extensively with Python, but neither Java nor C allows for this enormously compact syntax for querying and manipulating data sets. Because the number of functions are pretty large, it may be a bit daunting to find out what objects or functions there are, or what they do. Therefore, I suggest to install Eclipse or another IDE and run pydev within that for code completion purposes.

Get the latest distribution of NLTK from here. You can install NLTK on Ubuntu as follows:
$ sudo -s
# apt-get install python-numpy python-matplotlib prover9
# unzip nltk-2.0b3.zip
# cd nltk-2.0b3/
# sudo python setup.py install
# python
Python 2.6...... (......)
>>> import nltk
>>> nltk.download()
The following shows a cumulative frequency plot of the words that occurred most often first:
There are many other things that can be done. This is a textual example of collocations in the "Inaugural Address". Collocations are words that appear together frequently.
>>> text7.collocations()
Building collocations list
million *U*; New York; billion *U*; Wall Street; program trading; Mrs.
Yeargin; vice president; Stock Exchange; Big Board; Georgia Gulf;
chief executive; Dow Jones; S&P 500; says *T*-1; York Stock; last
year; Sea Containers; South Korea; American Express; San Francisco
These examples are taken from the book of NLTK. So, NLTK isn't really about making text accessible or some kind of engine that you can use for parsing / understanding text. It's a toolkit for language processing experimentation, so that using that knowledge you can roll your own stuff afterwards. A specific design goal is low-threshold access to the tools and functions of the toolkit. This should allow people without programming experience to use machine processing tasks for their own research. This hopefully puts it well within reach for anyone looking into natural language processing, filtering spam, building search engines, building translation engines and so forth.

Wednesday, October 07, 2009

Rolling your own heap manager in Linux

I was looking at ways to maintain a lot of data in memory, which may get modified in any way through constant interaction and then methods to maintain that state across process invocations. One of the ways to do this is by picking out structures and objects and store them elsewhere, but that involves a lot of work. Since the process I am working has full knowledge of what happens inside it, I've considered the possibility to dump an entire data heap to disk, so that it can be read in later and immediately reused.

One of the things that you can't do in that case is maintain file descriptors or other state-bound items that are tied to sockets and so on. So, the heap may only maintain data and even with state one should be very careful, since state is often bound to the currently executing process. There's ways around those things too however, so I decided to just go along and do this heap writing/reading.

The idea is to set up a 2GB file as the standard and maximum size of a heap and then map the contents of that file into memory using mmap. Let's call this the file heap. The mmap call will automatically increase the memory address space when necessary (read up on brk() and sbrk()). This 2GB looks contiguous to the process (although Linux may have this in totally different pages in physical memory ). The idea thereafter is that the process handling uses different memory pages, storing current state that is related to current handling and that any data modifications are done using the file heap.

So, data -> file heap, any other necessary memory allocations are allocated from the standard glibc heap (using standard malloc() and free() ). This way, any data allocated with a specific s_malloc() or s_free() call will automatically be added to the internal data file in places that are available to it and programming will feel quite natural overall. It's just like dealing with normal memory.

When the program terminates through a standard quit call (or when it catches specific signals), the msync() call is called, syncing all memory changes to the disk file synchronously, the memory is detached and the application exits. This should guarantee ok consistency between runs. Another run attaches to the file and all the data is there in the right place. For now, the application requires that this mmap is attached to the same base address, so that any pointers inside the file still point to something valid later. An alternative is specific linux structs and allocation functions that maintain this householding, but that increases size significantly.

These methods and custom malloc() and free() implementations should allow the application to also do garbage collections, maintain reference counts and other clever stuff that I can't think of right now. The good thing is that this doesn't require the application to keep everything properly aligned, it deals with less complexity. The preallocated heap is then filled up until it's full. Theoretically, it's also a good idea to pre-slice the heap into three different areas and then work with three heaps instead. This means that all three heaps have good knowledge about the maximum size they should be having and they can arrange their memories with their own specialized algorithms.

Tuesday, October 06, 2009

Semantique

I've slowly started on a new experiment related to the other posts. I've read through documents and source code describing how the original adventure games were created. The z-machine is the most well-known specification. In its most basic form, an adventure game is a blob of data intermixed with code. The z-machine, the program that runs the adventure, is nothing more but a host to the adventure game (story file) and after loading the file in, which is part data and part executable, it starts execution. In general, this causes the game to print the current location. All other things are initialized to their general value. I've said that these games also had code in a way. This code is actually pretty neat, because it is using quite simple constructs from the virtual machine execution engine located inside it.

So, the z-machine is a regular virtual machine that can execute opcodes and also find object properties and values and so forth. Because the game is more or less static, in the sense that the objects contained in the game never change, except to the observer, each object remains in the same position in memory. So, you could refer to the east door in room #3 as object 63 for example, and then use that identifier to reason with that object. Each object has a number of attributes that can be set, which allow you to specify if some door is open or closed. There are also user-definable properties per object.

My interest is mostly in the fact that this model isn't widely used in programs. I intend to write a much more basic proof of concept, where I reserve 2G of memory and start loading concepts into this space. Then I can dump the contents of this space to disk and reload it later, where it will have the same state as before (in contrast to pickling or serialization processes on a per-object basis).

Using Ragel, it should not be too difficult to set up a parsing language which can accept standard sentences in close to natural language, such that it manipulates the state of that heap. Perhaps add a property dynamically to some class, or set an attribute of a certain object instance and so forth.

The idea is that the program, which is not in the strictest sense a virtual machine because it does not yet have opcodes, can be used to insert class descriptions at runtime using language descriptions and then using more sentences you could make specific instances of those "things". Then I'd like to make it possible to load "boot programs" that declare things of a certain type and later on the opcodes as a number of simple operations that can be executed, such that the program can start reasoning on a slightly higher level, without having all capabilities of the program pre-programmed and compiled in. This probably requires dynamic compilation, address substitution and so on, but that'll only be fun :).

At the front of this large bit of memory is a large dictionary of words that it understands, just like the z-machine. The idea is that these words refer to classes, instances or relations, describing their use and their meaning. Then when a word is seen in a stream later, it should be possible to dynamically resolve what should be happening and how the environment changes by looking up the knowledge links, enforcing the restrictions by nouns (an object/class's capability) and the intention of an action (the verb). This probably requires the coding of a couple of rules, but those rules are previously declared and dynamically compiled into another base and referred to as well.

As soon as the dynamic execution environment, programmed and thus driven through natural language, encounters an error, it will report the error in a human-friendly format (because it has access to property descriptions and other natural language stuff inserted earlier). This program won't be as powerful as a real programming language, but I'm only interested in its behaviour for reasoning. If you're interested, you should have a look at Appendix A of the standard rules zip file on this page, that gives an overview how I imagine rules and code is going to be done inside this application.

So in short, the z-machine and adventure games are used as inspiration for developing an environment in which (restricted) natural language is going to drive the logic and what is happening. It will still depend on the "program" (the sentences fed to it) what the environment eventually does (well, I think?), so theoretically it can also be used as a chatbot, or perhaps even a reasoning environment for scheduling problems....? Time will tell!

Sunday, October 04, 2009

On reasoning with ontologies in AI

The image here shows a very simple ontology of cities and countries. A larger graph of one such as the left can be used to make assertions about the location of cities and countries and the sizes of cities. I'll use the graph to show how artificial intelligence is much about reasoning and traversing within graphs. One of the techniques much referenced is "state-space search". Now, that sounds as if someone found a way to make a computer really smart, but once you start working out the details, you realize that such graphs are predefined either statically or dynamically. The static graphs are loaded from some datastore somewhere, the dynamic graphs (probably the usual kind) are the graphs that are dynamically built. A big problem in AI for graph searches is actually knowing which branches to expand, because ideally you expand only those branches with solutions on them. Expanding any others is a waste of time and effort. Currently, AI uses heuristics to approach that problem.

Graphs may also be used in knowledge based systems. The idea is that you insert a lot of knowledge in the form of statements (facts) and that you let the system derive new facts on what is given. So, for example, if you say that a Chicago is a city, and that in cities there live people, then it means that people live in Chicago, so any questions relating to "do people live in Chicago?" can be answered. When you approach the boundaries of the scope of the knowledge, the performance degrades considerably.

So what exactly is reasoning? If we look back at the adventure game of Zork, you could consider the flow of an adventure game as a finite state machine:


In adventure games, the players are reasoners and the pre-defined stories are basically FSM's, created by programmers or story writers. Another way of describing the objective of an adventure game is to find the sequence of commands that leads to a successful finish state. Some game commands don't change the game state, but instead inspect the value of some property or attribute (commands like look, listen, smell, etc.) and return a pre-configured string. Succesful manipulative commands always change the game state, but not necessarily in a way that is useful to get to the end of the game. For example, you could open / close a milk bottle in the fridge repeatedly without having any effect towards reaching the objective. For brevity, any commands that change the world state to something non useful are not visualized in the above graph.

Reasoning in the above graph is really easy, because you simply have to follow the graph from start->finish or vice versa and you solved the problem. It's a single path here. End of complexity?! . Actually, real problems are much more complicated than that, because they have different states that are modified at different moments. To put all the information into a single graph would create one that is too complicated to handle. A state machine can and should handle only one specific memory location (the integer describing the current state) and then the state machine should be accompanied with rules on how that state may be changed. The following graph shows three different state machines, where some transitions contain particular restrictions on when the transition can be executed successfully and when it can't.

We've just made a distinction per state machine based on the observed value of the memory location. So, if FSM_1 governs the locations, then each location is described by a single number. If FSM_2 governs having the key or not, then 6 means not having the key and 7 means having the key.

Reasoning has just become a bit harder in this case. Instead of just finding the reverse path, any reasoning program must start with a potential path that may be possible and then for each transition find out if there are constraints along that transition that might prohibit that change. Further complications arise when a given constraint is based on an event very early in the game, so following a path that eventually fails is very costly. This means that the heuristics for evaluating if the value may ever change in the desired direction becomes of great importance.

The above also shows how software is written. The software specification comes up with a set of rules, which are essentially dynamic generators for several state diagrams. The programmer adds his own variables to that. The programmer then glues various state diagrams together through "if" statements, making the diagrams depend on one another. Then add in a couple of state modifiers (which move one machine from state A -> B) and the party is complete.

Reasoning in this model means generating a deconstruction of the program or specification and inverting it. So a computer that can reason about programs should be able to look at the debug info of a computer program and make statements about how to achieve goal states. A bit like:
  1. For your program to return with code 2, the following paths are possible: "x,y,z,...".
  2. For path 'x' to be chosen, variable i must be equal to 3, j must contain "bird" and k must be less than 5.
  3. For path 'y' to be successful, .....
This shows that in step 2, the reasoning mechanism establishes sub-goals from which reasoning needs to continue. In this case, the sub goals refer to some location in memory and a desirable value for that location. Let's assume that k should never be less than 5, because the knowledge base dictates that it is meaningless (the program contains an error in that case to compare the value to something meaningless). The sub-goal will then disappear from the board altogether, leaving only i or j as possible candidates. Reasoning becomes easier if the reasoning system has access to the inverted program, which I imagine to be a description of some program in reverse.

A genuinely new look at computer-based reasoning could research how conditional jumps are used and whether they are useful for a basis of reasoning. Can they be used to model the complexities of actual reasoning itself? The idea is that without looking at the contents of fields and understanding the semantics, as long as we understand the structure of a state machine (the way how states change), we may be able to reason with it. After all, a name to refer to something is just a pointer to some memory about the properties and classification of such a thing. Just like a computer may have a pointer to a piece of memory that has other pointers to other capabilities.

But what if we only have a start and an end goal and a very large ontology? Then we don't have a program to analyze, only fragments of diagrams that are not connected to one another. The added complexity here is that we should use the rules within the ontology to find ways to connect ontologies and subdiagrams together such that it still makes sense.

Is this different from other logic languages like prolog? Hmmm.. I don't know yet. A good test would be to see if there are different consequences of using such a model. Most programming languages require tests to be written for certain things. And it gets very repetitive to insert lots of facts about the things, plus that an ontology depends a lot on the proper classifications. It's probably related to a different look on things, like complete separation of terms (nouns and verbs) or something.

Saturday, October 03, 2009

Lessons from Zork

If you're in your 30's like me and as a child, you got a Commodore, MSX or one of those other computers in that generation, you might remember playing text-based adventure games. I was hooked on them at age 9 actually and they taught me English (well, and the dictionary did :). Because of those games, I scored 9/10 for all English classes. We never heard of the Internet back then, but there were already English-spoken movies with subtitling on TV then, so that helped as well a bit. However, this post isn't meant to recollect those stories, it's about the design of the interpreters in those times. Just recently, I'm very interested in different designs and approaches to parsing, because it's playing a central role to Natural Language Processing and in that sense to Artificial Intelligent programs that can deal with input in natural language. The Zork text-adventure games were the start of a range of games in the genre. And one of its important necessary features is to accept input from human beings giving orders, process it and then reply back with the results to that command.

Zork actually had a very interesting design of handling this. The story itself was loaded into a z-machine, which executed based on user input. If you want to experiment with such interpreters, have a look at frotz. There's actually a very low-key community in the world who's still playing adventure-games or, as they call it now, reading interactive fiction. They publish their work here and there. Here's an archive for reference.

Initially, you may think that writing a work of interactive fiction or text adventure is a lot of coding work, but there's actually a very cool application that you can use to produce them, called inform. And there is no real coding involved, everything is typed in natural language. Since at one time, a computer needs to run your story, the only not so natural thing is that you declare properties and attributes about things that you personally already take for granted. Because to a computer it is not obvious that a bed can support a person and you can enter it, that needs to be declared explicitly in the story (the story eventually becomes the program). Using inform, a lot more people can write stories and adventures.

At this point, it's probably better to show an example:
The troll can be conscious or unconscious.
Check attacking the unconscious troll with something:
say "The unconscious troll cannot defend himself: He dies.";
say "Almost as soon as the troll breathes his last breath, a cloud
of sinister black fog envelops him, and when the fog lifts, the
carcass has disappeared.";
remove the troll from play instead.
So notice how the troll is given attributes and properties and how some sort of reasoning process is inserted into this process just by applying natural language. The sequence of words (the phrase) used to manipulate the behaviour of the story (the program) is fixed. The descriptions, what you put between the double quotes, is just a sequence of characters and remains unintelligible to the computer. The story is then first compiled into a slightly lower-level language, which compilers towards z-code can interpret:
[ TakeDrink;
canteen--;
if (canteen lt 0) {
thirst=0; UpdateThirst();
} else {
thirst=4;
"Better watch for an oasis !";
}
];
This lower-level language is the language in which the older text adventures were coded. They used a lot of variables (often global) and contained very simple code that influenced game play. The limit was that of the programmer basically. It probably feels a bit like scripting.

For inform, by declaring items as a kind (a category of what it is, e.g: The spoon is a thing), they instantly get a couple of capabilities and the program knows what can be and cannot be done to them, because the inform compiler combines it with a default rule set (which may be overridden in the text). The (intended) interaction of the player with its environment is always communicated by verbs and nouns. In the case of inform, the file already has a large number of default rules, actions and behaviours for all these verbs and possible interactions. Per verb, the program maintains a number of rules with regards to impossible actions, or perhaps temporarily impossible actions due to some state in the interpreter (maybe the player should eat spinach first before they can open the door?). For further reading, the architecture of inform is listed here.

The interpreter for these programs executes so called opcodes, just like Java basically, so it is a kind of virtual machine. An interpreter is like an implementation of a virtual processor and memory on your computer that has the ability to do things at a slightly higher level of execution and protection (this is another way of saying that you're grouping cpu opcodes together in a useful way). In contrast to a function or method in a programming language, the interpreter makes no assumptions on the context in which it is used, it's only a dumb operational thing. The function or method for general programming languages exist within a very specific context of application.

Because Zork and the z-machine were initially developed for computers with very limited resources (64K? 128K?), there is a lot of optimization being done to cram the strings together (5-bit) and there are limits set on the memory addresses of the interpreter. Nowadays though, those limits can be relaxed heavily and possibly this could allow some very nice programs to be created using these virtual interpreters. Read on here for more information on these z-machines and their relevance to AI.

The text in the link above mentions Prolog. I've come to both respect and hate prolog. Prolog is also an interpreter, but works with a very general-purpose language and is very much related to programming in the general sense. Prolog always attempts to prove that your statements unify to something (become true) and does all in its efforts to make it so. It is strongly related to first-order predicate logic and research that entails it.

Now... you may know that when the program behaviour needs to be extended or modified, you generally need to stop the program, change the sources, recompile and then kill the running application and restart with the modified version. In that sense, a program on a computer is static. It was never considered in the architecture of hardware and software that programs needed to change at runtime. This is because they were executing tasks that were entirely thought out before the implementation. Then the implementation takes place, you run it and it can take care of business for a while.

An interpreter is an ideal environment for running experiments where the program can change at runtime and must be able to modify its behaviour and view on the world. If this interpreter can connect to programs online, the natural language can also be used as a means to communicate ontologies with other programs or humans, much in the same way that the above declarations in adventure games are used to extend the ontology of the adventure game (since that is what it essentially does). The difference is that the ontology is compiled beforehand and from that point onwards becomes static.

The bad thing about AI is that it's feeling the pull of the initial computer years and the way how we think about computers, or actually consider what processing is. To most computer-savvy people, it's the most normal thing to kill applications, recode them, restart them and then observe what it does. I think we should probably regard computers slightly differently to make more progress in AI.

Possibly because of the argument above, ontologies for the web are also mostly expressed as some sort of static files. The idea is that knowledge is temporarily static and then doesn't change? Also, even though they look meaningful to us (although interspersed with nonsense brackets and dashes and other signs), the computer just sees xyzxyzxyz and absdfoip and it's all nonsense to it, except the way in which it appears and can be recognized later. It's the reasoning over those forms of appearance that it can pretend to be processing them semantically. The gotcha is in the fact that when we look at the files, they look meaningful, but that's because the words give us a short replay of the vision, hearing or feeling related to the terms.

The truth is that the computer has no knowledge at all about the meaning of the terms and just executes some other code when it sees xyzxyzxyz again or when it knows that abcabcabc was in one way related to xyzxyzxyz. If you want to do yourself a favour to understand ontologies properly, recode the entire ontology in such nonsense terms and it becomes clear that it's easy to be misguided about how much computers really know.

Conclusion: The design of the z-machine interpreter is a very interesting concept for further AI research. One should not only consider interpreters to be relevant at runtime, but also integrate interpreters with compiler constructs, such that a phrase of input doesn't necessarily only modify the state of the interpreter, but may also modify the program itself (adding knowledge to the knowledge base). This allows one machine to talk to another using natural language (easier to debug) and it requires only one interface implementation, since humans use the same channel of input. The interpreter should also have the requirement that it can sync its state to disk and start up later with that memory load, such that it can continue executing from where it left off the last time. An adventure game interpreter is coded with a goal in mind and that is executing until somehow it reaches an end state (you could visualize the interpreter as a markov chain and even as a finite state machine), where knowledge is fixed and the transitions of one knowledge element to another or one contextual state to another is determined by rules.

Ongoing question: Now, this gives us one interesting question to ponder over next: for an environment in which knowledge may be modified and received and states relating to that knowledge manipulated (the context), who or what will set the goals for this interpreter to achieve and what do those goals look like? Can the interpreter determine its own goals and negotiate with us or other computers to try to achieve them?

Thursday, October 01, 2009

CSV file parsing with ragel

I wanted to get my feet wet a bit more with Ragel to get acquainted with the ways it works. Some good examples demonstrating the syntax are here. It's definitely a very impressive piece of software. Setting up a toy language is a bit too much work for just toying around, so I decided to find out how to parse a CSV file. I started out with the sample to do parameter parsing from the command line and adjusted it to read a 3-column CSV file. It's probably not the most efficient code and leaves things to be improved:


#include
#include
#include

#define MAX_BUF_LEN 1023

struct csvline {
char *f1;
char *f2;
int f3;
};

struct csvdata {
int cs;
int buflen;
char buffer[ MAX_BUF_LEN + 1 ];
int field;
struct csvline line;
};

void print_data( struct csvdata *data, int lineno );

%%{
machine csv;
access data->;

# Append to the buffer.
action append {
if ( data->buflen < MAX_BUF_LEN )
data->buffer[ data->buflen++ ] = fc;
}

# Terminate a buffer.
action term {
if ( data->buflen < MAX_BUF_LEN )
data->buffer[ data->buflen++ ] = 0;

switch( data->field ) {
case 0:
data->line.f1 = (char *)calloc( data->buflen, sizeof( char ) );
strncpy( data->line.f1, data->buffer, data->buflen );
data->field++;
break;
case 1:
data->line.f2 = (char *)calloc( data->buflen, sizeof( char ) );
strncpy( data->line.f2, data->buffer, data->buflen );
data->field++;
break;
case 2:
data->line.f3 = atoi( data->buffer );
data->field++;
break;
default:
// ignore
break;
}
}

# Clear out the buffer
action clear { data->buflen = 0; }

# Helpers that collect strings
LF = "\n";
string = [^,]* >clear $append;
string2 = [^,]* >clear $append %term;
comma = "," %term;
main := ( string comma )+ string2 ? LF;
}%%

%% write data;

void csv_init( struct csvdata *data ) {
data->buflen = 0;
%% write init;
}

void csv_exec( struct csvdata *data, const char *d, int len )
{
const char *p = d;
const char *pe = d + len;

%% write exec;
}

int csv_finish( struct csvdata *data )
{
if ( data->cs == csv_error )
return -1;
if ( data->cs >= csv_first_final )
return 1;
return 0;
}

#define BUFSIZE 2048

int main( int argc, char **argv )
{
struct csvdata csvdata;
FILE *csvfile;
int lineno = 0;
char buf[ MAX_BUF_LEN + 1 ] = {"\0"};

if (( csvfile = fopen( "test.csv", "r" ) ) == NULL ) {
fprintf( stderr, "Could not open file test.csv\n" );
return -1;
}

while ( fgets( buf, MAX_BUF_LEN, csvfile ) != NULL ) {
// One more line to process
memset( &csvdata, 0x00, sizeof( csvdata ));
csv_exec( &csvdata, buf, strlen( buf ) );
if ( csv_finish( &csvdata ) != 1 ) {
fprintf( stderr, "error occurred in line: %d\n", lineno );
} else {
print_data( &csvdata, lineno );
}
lineno++;
}

return 0;
}

void print_data( struct csvdata *data, int lineno ) {
fprintf( stdout, "[line %d] f1: %s", lineno, data->line.f1 );
fprintf( stdout, ", f2: %s", data->line.f2 );
fprintf( stdout, ", f3: %d\n", data->line.f3 );
}


The following test.csv file was used:


test1,test2,4000
more,data,5032335
and,even,111
more,data,1213
errorhere
invalid line


And this is how to compile and visualize:


compile.sh:
-----------

#!/bin/bash

ragel main.rl
ragel -V main.rl > test.dot
gcc -o main main.c

visualize.sh:
-------------

#!/bin/bash

dot -Tpng -otest.png test.dot
eog test.png