Saturday, September 19, 2009

Supercharging reliable packet transfer capacity

Lighttpd is a very light, high performance web server that uses a different architecture for handling connections. It is being used for meebo and youtube, but probably with modifications. Although Apache is the best known webserver around and has been used for years as a very reliable webserver, the rise in network communications capacity is showing that the software itself is now more likely the bottleneck in communications than the hardware itself. Apache with its threaded architecture doesn't scale as well or use the hardware resources as well as lighttpd can. The reason is related to the specific use of resources on the operating system, like the number of file handles, threads, and so on. Two very important issues become copying data from socket to kernel memory and finally to user memory. This is one copy too many and people are researching how to prevent this first copy taking place, such that data arriving at the network card can be copied straight into user memory when available. The second problem is related to the overhead of thread/process context switches and the repercussions a high number of threads and processes have on overall system performance. In effect, the consequence of these systems is that the kernel spends more time figuring out which thread/process to run and housekeeping efforts become larger.

In communications where sockets are associated with threads (let's assume for the remainder of this type that threads and processes are interchangeable), the threads typically block on particular operations within the communication cycle. So, writing to a socket may succeed immediately, or it may block the sender until either the socket is disconnected or the remote end has received the data, such that the client buffer sending the data becomes free. If the thread blocks on either send or receive, the thread is put into a particular blocking state, signalling the kernel that it only needs to be woken up if some buffer becomes empty or data arrives at the socket. Having, let's say, 10,000 threads around then sounds like a lot, but in the end only a couple of those threads are actually executable on the CPU. The advantage of this approach is that programming is pretty clear and the program is easy to understand. The disadvantage is that this model is not scaleable towards the actual hardware capacity installed.

In looking at some specific problems I am facing, I also noticed that this above model doesn't scale at all for situations where clients send out notification packets. Those are characterized by clients sending a bit of data and closing the connection immediately. They can then pick up another notification to be sent, send it immediately and cycle on that. It is also possible that there are a large number of clients sending data.

For these situations, TCP state transitions look like this:

client         send ->     <-  send     server
ESTABLISHED     FIN           --        ESTABLISHED
FIN_WAIT1       --            ACK       CLOSE_WAIT
FIN_WAIT2       --            --        CLOSE_WAIT

...... server processes .......

FIN_WAIT2       --            FIN       LAST_ACK
TIME_WAIT       ACK           --        CLOSED

The "CLOSE_WAIT" state is maintained until the server has accepted the socket (which is already in CLOSE_WAIT now), reads the data, then closes the socket itself. A more serious problem arises when the number of sockets in CLOSE_WAIT equals the size of the backlog queue of the listening socket. Let's say this is 128, which is a reasonable number for Linux. The number of sockets in CLOSE_WAIT becomes 128, the server doesn't accept any more connections and it only allows more connections when the server picks up one of the pending connections by calling accept(), handles it and closes it. If the number of clients is relatively high, this server becomes a bottle-neck for any communications up-stream.

Lighttpd supports a new architecture that is available for newer kernels. Actually, it supports a couple of new constructs for handling TCP traffic, that which are given by kqueue, epoll, select, poll and so on. The ideas of these architectures is that a thread should not be created just to wait for data to come in, but the thread selects which sockets are in a state that a successful read or write can take place and the CPU will initiate those actions. The idea is that you use less threads to do the same kind of work with better use of resources, such that you become more efficient overall, such that you can handle more connections and communication overall, increasing the throughput which should easily be handled by the server.

A very good discussion on performance is here. It is also linking to a very good presentation.

No comments: