Wednesday, October 07, 2009

Rolling your own heap manager in Linux

I was looking at ways to maintain a lot of data in memory, which may get modified in any way through constant interaction and then methods to maintain that state across process invocations. One of the ways to do this is by picking out structures and objects and store them elsewhere, but that involves a lot of work. Since the process I am working has full knowledge of what happens inside it, I've considered the possibility to dump an entire data heap to disk, so that it can be read in later and immediately reused.

One of the things that you can't do in that case is maintain file descriptors or other state-bound items that are tied to sockets and so on. So, the heap may only maintain data and even with state one should be very careful, since state is often bound to the currently executing process. There's ways around those things too however, so I decided to just go along and do this heap writing/reading.

The idea is to set up a 2GB file as the standard and maximum size of a heap and then map the contents of that file into memory using mmap. Let's call this the file heap. The mmap call will automatically increase the memory address space when necessary (read up on brk() and sbrk()). This 2GB looks contiguous to the process (although Linux may have this in totally different pages in physical memory ). The idea thereafter is that the process handling uses different memory pages, storing current state that is related to current handling and that any data modifications are done using the file heap.

So, data -> file heap, any other necessary memory allocations are allocated from the standard glibc heap (using standard malloc() and free() ). This way, any data allocated with a specific s_malloc() or s_free() call will automatically be added to the internal data file in places that are available to it and programming will feel quite natural overall. It's just like dealing with normal memory.

When the program terminates through a standard quit call (or when it catches specific signals), the msync() call is called, syncing all memory changes to the disk file synchronously, the memory is detached and the application exits. This should guarantee ok consistency between runs. Another run attaches to the file and all the data is there in the right place. For now, the application requires that this mmap is attached to the same base address, so that any pointers inside the file still point to something valid later. An alternative is specific linux structs and allocation functions that maintain this householding, but that increases size significantly.

These methods and custom malloc() and free() implementations should allow the application to also do garbage collections, maintain reference counts and other clever stuff that I can't think of right now. The good thing is that this doesn't require the application to keep everything properly aligned, it deals with less complexity. The preallocated heap is then filled up until it's full. Theoretically, it's also a good idea to pre-slice the heap into three different areas and then work with three heaps instead. This means that all three heaps have good knowledge about the maximum size they should be having and they can arrange their memories with their own specialized algorithms.

No comments: