select () poll() semantics

If I want to use a simple counting semaphore then what happens when I want to wait on both reads and writes?

The theory goes that both READ and WRITE will Post to the semaphore for their particular semantic states. Read posts when it has data, Write posts when it has no data or room in its buffer for more data. The problem is that this is in practice useless. If you register for READ and WRITE, because the write buffers are almost always empty you will thrash with messages that its ok to write. So you only want to poll for WRITE when you have something to write. Which is almost always what you do. But If you are running reads and writes on one thread it all gets rather complicated. What you do is wait on your internal write buffer till you have something to put into the file write buffers, so in effect you are performing a read wait on internal buffers and then placing that file into the file write poll until you have emptied your internal buffer. So you only poll on WRITE when there is something to read from your internal write buffers and transfer into the file write buffers. This is how the MetaWrap server Poll system works which quite happily runs at thousands of transactions per second.

The other approach is to split read and writes into seperate sempahores, I would need to perform a wait on multiple semaphores together as one atomic Wait operation and be able to detect which semaphore is actually responsible for the Wait exiting and then determine which file on that semaphore is the correct one without having to stampeed through the lot. This is not as hard as it sounds – but ugly and slightly insane.

Other interfaces seem to provide the ability to wait on multiple types of IO operations (POLLIN | POLLOUT | POLLERR | POLLHUP).. see the following

Poll

poll is a variation on the theme of select. It specifies an array of nfds structures of type

               struct pollfd {
                       int fd;           /* file descriptor */
                       short events;     /* requested events */
                       short revents;    /* returned events */
               };

and a timeout in milliseconds. A negative value means infinite timeout. The field fd contains a file descriptor for an open file. The field events is an input parameter, a bitmask specifying the events the application is interested in. The field revents is an output parameter, filled by the kernel with the events that actually occurred, either of the type requested, or of one of the types POLLERR or POLLHUP or POLLNVAL. (These three bits are meaningless in the events field, and will be set in the revents field whenever the corresponding condition is true.) If none of the events requested (and no error) has occurred for any of the file descriptors, the kernel waits for timeout milliseconds for one of these events to occur. The following possible bits in these masks are defined in <sys/poll.h>

Select

int select(int n, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);

int pselect(int n, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, const struct timespec *timeout, const sigset_t *sigmask);

       FD_CLR(int fd, fd_set *set);
       FD_ISSET(int fd, fd_set *set);
       FD_SET(int fd, fd_set *set);
       FD_ZERO(fd_set *set);

DESCRIPTION
The functions select and pselect wait for a number of file descriptors to change status.

Their function is identical, with three differences:

(i) The select function uses a timeout that is a struct timeval (with seconds and microseconds), while pselect uses a struct timespec (with seconds and nanoseconds).

(ii) The select function may update the timeout parameter to indicate how much time was left. The pselect function does not change this parameter.

(iii) The select function has no sigmask parameter, and behaves as pselect called with NULL sigmask.

Three independent sets of descriptors are watched. Those listed in readfds will be watched to see if characters become available for reading (more precisely, to see if a read will not block – in particular, a file descriptor is also ready on end-of-file), those in writefds will be watched to see if a write will not block, and those in exceptfds will be watched for exceptions. On exit, the sets are modified in place to indicate which descriptors actually changed status.

Dev Poll

The Solaris[tm] 7 Operating Environment (11/99 version) introduced a new mechanism for polling file descriptors. The /dev/poll driver is a special psuedo driver that allows a process to monitor multiple sets of polled file descriptors. Access to the /dev/poll driver is provided through the open(2), write(2) and ioctl(2) system calls.

The /dev/poll driver returns the number of the polled file descriptors that have data in them. One can then read only those file descriptors using a pollfd type array. Thus, one does not have to continuously poll a large number of file descriptors including those that may not have any data in them. This makes for a more efficient use of the system resources.

The /dev/poll interface is recommended for polling a large number of file descriptors, of which only a few may have data at any given time. The devpoll interface works best with the newer set of Ultrasparc IIi and Ultrasparc III processors. Applications best suited to use the devpoll driver include:

Applications that repeatedly poll a large number of file descriptors.
Applications where the polled file descriptors are relatively stable; that is, they are not constantly closed and reopened.
Applications where the set of file descriptors which actually have polled events pending is small, compared to the total number of file descriptors being polled.

For a detailed description of the merits of using the new /dev/poll interface versus the poll(2) system call, please refer to the technical article “Polling Made Efficient”, by Bruce Chapman at http://soldc.sun.com/articles/polling_efficient.html.

Dev EPoll

What is /dev/epoll?

/dev/epoll is a patch to the Linux operating system that will enable applications specifically tailored to detect and use it to operate more quickly. It will not accelerate general Linux applications, and it will not make the kernel run more quickly. It only works on the Linux 2.4 series and above (including 2.5).

The patch creates a new device in the kernel, /dev/epoll, which allows programmers to efficiently enumerate pending events on a number of sockets or pipes. It works in a manner somewhat similar to poll() and is used in a very similar fashion to Solaris 8’s /dev/poll device.

In practice I always have a writing and reading thread. So I only ever wait on one type of operation. This reduces the complexity of my requirement for CNI Poll – however some people may want to do everything on one thread – they will be forced to run multiple threads, but I think I can live with that.

As long as I can get each of the IO types (Read, Write, Error, Disconnect) to map their states into a single semaphore then I can provide an all in one implementation. I may have a different execution branch for pure READ/WRITE etc which will give me the optimal speed I want when I want to run types on a separate thread.