[FrontPage] [TitleIndex] [WordIndex

Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

The I/O subsystem is the part of an operating system that interfaces with input/output devices. Since these devices come in a wide variety of flavors, it is usually the most complex part of the kernel. To handle this complexity, the I/O subsystem is typically split up among device drivers that all present their underlying devices to the kernel using one of a small number of standard interfaces—essentially an object-oriented approach.

Usual disclaimer about sketchy lecture notes applies. For more details see SGG Chapter 13 or TanenbaumBook Chapter 5.

1. I/O devices (user view)

Many flavors. These can be broadly classified into:

Block devices
Disk drives, tape drives, USB memory keys, etc. At an abstract level, these look like a big pile of bytes organized into blocks of some fixed size. (Technically, the use of blocks is an optimization, to amortize control costs over many bytes of data.) Typical operations: read (blocks), write (blocks), seek (next block to read/write).
Character devices

Also called stream devices. Keyboards, modems, printers, audio cards, robots, USB Nerf cannons. These look like a stream of bytes with no particular structure imposed by the device. Typical operations: read, write. Some may provide very sophisticated functionality on top of what is provided by the raw device; Unix terminal drivers are a good (?) example.

Network devices
Somewhat similar to stream devices, but distinguished by having built-in various high-level network protocols that may add more structure (e.g. UDP datagrams, TCP connections).

There are also some specialized devices that less visible to the user but provide essential services to the kernel, like the interrupt controller (which typically includes a clock for scheduling regular interrupts) or a DMA controller (see below).

There are also several issues that come up quickly in designing the user-level I/O interface:

Asynchronous vs synchronous I/O

Asynchronous I/O is non-blocking: the program (or CPU) keeps going after starting an I/O operation, and is notified by an interrupt when the operation completes. Most I/O operations at the hardware level are asynchronous—this is good, because I/O is slow. But programming for asynchronous I/O at the user level is difficult. So the kernel will typically present a synchronous or blocking interface where low-level asynchronous operations appear to be synchronous by the simple expedient of blocking processes that are waiting for them.

Data from I/O devices often arrives at inconvenient times or in inconvenient units. The use of buffers in devices and in the kernel can hide some of this from the user. However, storing data in buffers can slow down I/O by requiring excessive copying, so there are trade-offs involved.
Can an I/O device be shared or multiplexed between multiple processes? If so, the kernel needs to take responsibility for handling multiple requests. Examples of sharable devices include disk drives (usually via a high-level filesystem interface), displays (via window systems), or audio devices (unless you are running an old version of Linux). Other devices like tape drives or modems may be non-sharable, and the kernel has to enforce that only a single process can use each at a time.

How does a process specify which device it wants? The Unix approach is to map devices into the filesystem (/dev/tty, /dev/hda1), which translates device names to a lower-level namespace consisting of major and minor device numbers. One can also structure a system to use a low-level namespace directly, but this loses the proverbial extra level of indirection that solves all programming problems.

2. I/O devices (hardware view)

At the hardware level, devices look very different from the user's view—so the operating system has to do a lot of work to hide this!

2.1. Controllers

We can think of each device as consisting of some physical object (the actual device) and a bit of hardware that presents an abstract interface to the object to the bus. This latter bit of hardware is called the device controller, and this is what the CPU interacts with. The question that remains is how to structure the interface between the CPU and the device controller.

Usually, a controller interacts with the CPU in up to three ways:

2.2. Input and output instructions

Many CPUs provide IN and OUT instructions for doing I/O. These look like the MOV instructions used to access memory, and take addresses (ports) as arguments that may even go out over the same address bus as memory operations, but a pin on the CPU signals that they are to be interpreted as I/O operations instead of memory operations, and so the supporting hardware routes them to the appropriate device.



2.3. Memory-mapped I/O

Here instead of using a separate address space for I/O, we use the same address space as for memory. The address translation hardware takes responsibility for diverting I/O addresses to I/O devices.



2.4. Hybrid approach

One common approach is to use both I/O instructions and memory-mapped I/O. The idea is that I/O instructions can be used for small values that change quickly (e.g. control registers) and memory mapping can be used for large chunks of data that are read or written in bulk (e.g., display memory). So, for example, the PC architecture allocates the memory range 0xA0000–0xFFFF0 (640K–1M) to memory-mapped I/O, and it is here that one finds VGA display memory etc.

This approach still requires a bus controller to divert memory-mapped I/O addresses to the I/O bus.

2.5. Direct Memory Access (DMA)

For block devices, it often makes sense to bypass the CPU and give them direct access to memory. This is usually done by providing a bridge between the I/O bus and the memory bus in the form of a DMA controller. The DMA controller looks to I/O devices exactly like the CPU: it sends the same control instructions and reads data across the same bus. But it is much more limited in what it does to the data; all it can do is stuff it in memory at a physical memory address specified by the CPU (through control registers on the DMA controller accessed by the CPU through the I/O bus), and issue an interrupt when the operation is done.

Note that because the CPU and DMA controller share the same buses, some concurrency control mechanism is needed to prevent them from stepping on each other. (A similar issue arises with multiple CPUs). There are some options here for how aggressive the DMA controller is about grabbing the bus. A polite DMA controller might limit itself to cycle stealing—grabbing the bus for a few cycles from time to time, so that the CPU never loses control of the bus for more than a few cycles. A less polite but more efficient DMA controller might use burst mode where it seizes the memory and I/O buses long enough to complete a full block read or write operation, or possibly even several such operations. The downside of this approach is that the CPU might stall because it doesn't have enough data in its cache to proceed until the I/O operation is finished.

Using a DMA controller adds complexity to both the hardware and the OS. The payoff is that it takes load off the CPU, which might allow it to proceed with its own CPU-intensive tasks.

3. A typical I/O operation from start to finish

Suppose a user process asks to read a block off of the disk (we will ignore the filesystem interface for the moment, and imagine it is reading directly from a non-shared disk device). What happens?

  1. Process executes read system call with appropriate arguments.

  2. Kernel translates arguments to low-level device controller instructions and writes them to the disk driver controller registers (or possibly hands this task off to the DMA controller). It also blocks the process and proceeds with some other process.
  3. Disk drive controller does whatever magic it needs to do to get the relevant bits off the physical disk hardware.
  4. Bits are read into a buffer on the disk drive controller. These are transmitted serially across a cable between the disk drive controller and the disk.
  5. Disk drive controller verifies bytes read by checking checksums, looking for drive error codes, etc. It may retry the read operation if it failed.
  6. If everything worked, disk controller signals CPU and/or DMA controller that data is ready.
  7. CPU or DMA controller reads data from disk controller buffer and copies it to main memory.
  8. Kernel maps or copies data into process's address space and unblocks it.

There are several options for how the various stages of this process work. One big choice is which device is responsible for waiting for the I/O operation to complete. Choices are:

Programmed I/O

CPU runs everything itself. Device controller is a pathetic stub whose only purpose is to translate CPU bus commands directly to the I/O hardware. Examples are WinModems, some printer controllers, some very early primitive graphics controllers. Advantage: Allows for cheaper hardware; may allow for very sophisticated I/O programming (handy with modems). Disadvantage: Like hiring a Formula 1 mechanic to change your oil; CPU has to poll I/O device to see if it needs to do anything; in the worst case, I/O device eats up the entire CPU, locking up everything else while the I/O operation proceeds.

Interrupt-driven I/O
Device controller does most of the work, notifies CPU when done. CPU still has to extract data itself.
DMA-based I/O
DMA controller takes over CPU's role, interrupts CPU when data is available in memory. Advantage: minimum load on CPU, especially in the form of fewer interrupts since DMA controller can do more buffering. Disadvantage: maximum hardware complexity.

4. I/O system architecture

From the ground up, we have

Interrupt handlers

These respond immediately to I/O interrupts and handle incoming data, usually by calling an interrupt service procedure supplied by the device driver (see below). Core kernel component.

Device drivers

Higher-level interface that translates user-visible device abstractions (e.g. block devices) to low-level incantations sent to device controllers. Usually in kernel because of the need for privileged I/O instructions. On portable operating systems, are often separated into an upper half that is machine-independent and a lower half that varies depending on the CPU architecture. The lower half executes machine-specific I/O instructions and usually provides the guts of the interrupt service procedure.

Device-independent I/O software
Anything that doesn't depend on the specific details of a device beyond its abstract description as e.g. a character or block device. This may include higher-level mechansism like network stacks or filesystems as well as lower-level mechanisms like virtual-memory management or I/O scheduling systems.
User-level I/O libraries

Think stdio. Adds another layer of buffering and abstraction on system call interface.

User programs
What the game is ultimately all about.

4.1. Device driver architecture

Having gazillions of bizarre devices, each with its own specialized control mechanism, leads to pressure to move device drivers out of the core kernel. This gives a progression of increasingly decoupled device driver designs:

4.1.1. What a device driver module looks like

First we need to settle on a standard interface. For a Unix-style block device, we might have something like:

   1 struct dev_hdr {
   2     /* initialize the device; called once at boot time or module load time */
   3     int (*init)(void); 
   5     /* shut down the device; called once at shutdown */
   6     int (*halt)(void);
   8     /* read or write the given number of bytes to or from buffer */
   9     int (*read)(unsigned long bytes, void *buffer);
  10     int (*write)(unsigned long bytes, void *buffer);
  12     /* next read or write works on this position */
  13     int (*seek)(unsigned long position);
  15     /* loophole operation for everything else */
  16     /* uses variable arguments */
  17     int (*ioctl)();
  18 }

A character device would be similar, except it wouldn't provide seek.

This is essentially an implementation in C of a C++ or Java method table. The ioctl loophole is there to provide extra methods for devices that need them (e.g. for setting parameters on a disk drive, turning autofire off on your Nerf cannon, or toggling the LEDs on your keyboard). In a real OO programming language this would be handled by subclassing.

As kernel development continues and certain operations start appearing across multiple devices, they are likely to move out of the ioctl bin and into the main structure. The final result might be something like the file_operations data structure from the Linux 2.6 kernel (see /usr/src/linux/include/linux/fs.h):

   1 struct file_operations {
   2         struct module *owner;
   3         loff_t (*llseek) (struct file *, loff_t, int);
   4         ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
   5         ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t);
   6         ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
   7         ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t);
   8         int (*readdir) (struct file *, void *, filldir_t);
   9         unsigned int (*poll) (struct file *, struct poll_table_struct *);
  10         int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
  11         long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
  12         long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
  13         int (*mmap) (struct file *, struct vm_area_struct *);
  14         int (*open) (struct inode *, struct file *);
  15         int (*flush) (struct file *);
  16         int (*release) (struct inode *, struct file *);
  17         int (*fsync) (struct file *, struct dentry *, int datasync);
  18         int (*aio_fsync) (struct kiocb *, int datasync);
  19         int (*fasync) (int, struct file *, int);
  20         int (*lock) (struct file *, int, struct file_lock *);
  21         ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
  22         ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
  23         ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *);
  24         ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
  25         unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
  26         int (*check_flags)(int);
  27         int (*dir_notify)(struct file *filp, unsigned long arg);
  28         int (*flock) (struct file *, int, struct file_lock *);
  29         int (*open_exec) (struct inode *);
  30 };

Here the struct file * arguments are pointers to the kernel's internal data structure associated with the device (which is pretending to be a file from the point of view of the file system), the equivalent of self or this in an OO language. This allows using the same driver procedures for multiple similar devices.

4.1.2. Loading a device driver

Device drivers for core resources like the display or keyboard are likely to be built directly into the kernel. Device drivers for hardware that may or may not be present are more likely to show up as loadable modules.

Loading a device driver is much like loading a program. The kernel (or possibly a userspace process with especially good access to kernel memory) reads the driver from disk (typically including a linking and relocation step), allocates space for it, then calls its setup routine to initialize it. This initialization may require further calls to internal kernel routines to allocate more space or new kernel threads, may involve setting up the device, and so forth.

Subsequent calls to the device driver happen as internal procedure calls via the driver's jump table. The driver generally has full access to the kernel's address space and is usually compiled against a copy of the kernel so that it can be linked to kernel procedures.

4.1.3. Risks

Loadable kernel modules are the biggest risk to kernel integrity, since they are often written by monkeys but run with full access to kernel internals. A rule of thumb is that device drivers have 5-10 more bugs per thousand lines of code than core kernel code, and unlike a user process a broken device driver can bring down the entire system. In the worst case kernel modules can be used maliciously to build rootkits and similar bits of nastiness. At minimum this means that we can't let just anybody load new modules into the kernel. In the long run, it creates pressure to provide better containment for device drivers, either by using strongly-typed languages and/or signed code to constrain or at least vet their behavior, or by pushing them out into a less destructive part of the system using microkernel trickery.

4.2. Device-independent I/O components

These include things like buffering and scheduling mechanisms. The idea is that once we get above a standard device abstraction, we can implement common techniques like buffering and I/O scheduling in one place instead of scattered throughout the device drivers. It is also here that we expect to see things like terminal drivers, filesystems, and network protocol implementations.

5. Example: console/keyboard driver

Let's look at how we might write a driver for the console (built-in VGA display) and keyboard on a PC. We might or might not want to separate these out into separate devices, but for the moment let's imagine that we've globbed them together as a single character device, where writing to the device prints to the screen and reading from the device reads characters from the keyboard.

5.1. Output

Writing to the screen is the easier task, since the screen is memory-mapped and we don't have to deal with issues like interrupts or the screen not being ready for output yet. We will probably still maintain some auxiliary information like cursor position in some kernel-internal data structure, so that we can present users with a clean stream-of-characters interface rather than force them to keep track of this for themselves.

So for output we will provide a write routine that:

  1. Takes a buffer of characters and copies them to screen memory at the location of the current cursor.
  2. Updates the cursor position.
  3. If necessary, wraps or scrolls the screen (which may involve quite a bit of copying!)

This routine will typically be called inside some process's kernel thread after being dispatched from some higher-level write system call.

Some questions we might want to ask: Can more than one process write to the screen? If so, what happens? At minimum we probably want to protect the screen data structures with a mutex so that write operations appear to be atomic.

We may also want to allow the user to control other features of the screen like the starting cursor position, character color, etc. There are two obvious ways to go about this:

  1. Use terminal escape sequences. These are codes embedded in the stream of characters being written that change the terminal's behavior. There is an ANSI standard for terminal escape sequences (see ANSI_escape_code), so this has the advantage of compatibility with vast piles of software that expect to be able to use them. The downside is that it adds complexity to the terminal driver and may require quoting output bytes that aren't intended to control the terminal functions.

  2. Use the ioctl mechanism or some similar expansion mechanism to provide system calls that allow access to cursor position, character attributes, etc. This is in many ways a less desirable solution, even though it provides more consistency with the standard device driver approach. One reason why this is less desirable is that a program (say a text editor) may wish to update the cursor position and change character attributes many times within a short sequence of output. With an ioctl approach this would require many system calls—and their associated overhead both in time and programmer effort—instead of just one.

In practice the cost of doing multiple system calls means all but the laziest OS's provide some mechanism for processing escape sequences. This choice is analogous to the "little language" that appears in printf format specifiers (or FORTRAN FORMAT statements before them), where again a pure procedural interface is replaced by string processing.

5.2. Input

Keyboard input is more interesting, since we can't predict when the user will hit a key. On the IBM PC architecture the main keyboard is assigned IRQ 1, so we can use an interrupt handler to snarf keyboard input as it arrives. This is done by executing an inb operation on port 0x60, which will return a scan code which describes the position of the key that had something happened to it (either a key press or a key release). The scan code is not likely to be very useful to a user process, so we will probably want to do some translation. This gives us three steps in processing a keyboard event:

  1. Handling the interrupt. At minimum we need to save the incoming scan code somewhere.
  2. Translating the scan code to a more useful form. Many keyboard events we may be able to safely ignore (like releasing the E key). Some will require updating internal state that affects further translation (shift down/shift up). Some might even require doing some output operations to update the keyboard status (caps lock down—requires turning on or off the caps lock light). We may also want to give the user control over how this translation is done (a video game might want to know when the user releases the keep-jumping key).

  3. Delivering the result to any interested process or processes.

The interesting part here is that we have a lot of options over where to do these operations. If our goal is to get out of the interrupt handler as fast as possible, we might want to just do step 1 there, sticking the incoming scan code into a ring buffer somewhere and returning immediately. The translation and delivery steps could then be handled by any of (a) a kernel thread that wakes up from time to time when the ring buffer is nonempty (semaphore!) and does translation (monolithic kernel approach); (b) a userspace daemon that does the same (microkernel approach); or (c) a standard library routine that uses an address space mapping that provides direct access to the raw scan code buffer (exokernel approach). The more distant we make the translation and delivery process from the core of the kernel, the more control we give the user (e.g., being able to substitute different translation schemes, being able to stuff fake keyboard events into the input stream, being able to build a fancy keyboard-dispatching daemon that sends all the vowels to process 1 and all the consonants to process 2—who knows?), but the more costs we impose. For keyboard input, the costs are likely to be pretty low no matter what we do, so a simple approach is probably best. This probably either involves providing raw untranslated input or doing translation at the kernel level, perhaps even in the interrupt handler (since it's not very expensive).

We still have to decide how to get keystrokes to the user process. The choices here are eerily similar to the choices between programmed I/O (polling), interrupt-driven I/O, and DMA-based I/O at the kernel level:

  1. Provide a getchar system call that lets a user process wait for a key event. To implement this, we build a ring buffer in the kernel with an attached semaphore and have the user process (really its corresponding kernel thread) down (proberen) the semaphore when it wants a character. The process will wake up again (and extract the character from the ring buffer) when a character is available, because the interrupt handler or device driver upper half will up (verhogen) the semaphore when it adds a character to the buffer. We can also make a non-blocking version by allowing the process to peek at the buffer (presumably protected by a mutex) to see if there is anything there. The downside of this approach is that with the blocking version, the process can't do anything else while it is waiting (unless it has multiple threads) and with the non-blocking approach it wastes a lot of time checking for keypresses that haven't arrived yet.

  2. Signal the process asynchronously when a character arrives, e.g. using the POSIX SIGPOLL mechanism. Annoying for processes that really do want to wait for keypresses.
  3. Deliver the message using some high-level InterProcessCommunication mechanism. The nice thing about this is that now we can make anything pretend to be a keyboard. The disadvantage is that there may be a lot of overhead.

We also have an independent choice about when to deliver characters. Many processes might only care about receiving characters when the user hits enter, and might like the OS to handle details like buffering incoming characters and processing simple editing commands. There is a good chance that a process also would like the OS to each typed characters to the screen (at least some of the time). The logical response is to provide a sophisticated terminal driver that handles all of these tasks, delivering pre-masticated lines of text to the process via read system calls (if the terminal driver lives in the kernel) or an IPC mechanism (if it doesn't—in this case we may be able to give the IPC mechanism a file-like interface so the receiving process doesn't notice). The more sophisticated this terminal driver is, the more incentive there is to move it out of the kernel so that bugs won't trash anything else, but the more incentive there is to keep it in the kernel so that it has fast access to in-kernel data. It's not clear in the limit when the terminal driver grows to include a window system which approach is better; both internal window systems (Windows) and external window systems (X11, as used in Linux and OS/X) are both widely used.

6. Example: disk drivers

For block devices like disks the situation is more complicated, because the data rate is much higher (~500 Mb/s vs ~10 bytes/s for the keyboard) and latency is much more of a problem since so much might be waiting on the next disk access. So we want a disk driver that can buffer and multiplex disk accesses so that processes (and internal kernel threads like the virtual memory pager) think they happen quickly even if they don't. We also have to deal with the fact that disk operations are long, multi-stage processes that involve both input and output (even if we are only moving data in one direction), so we can't hope to just bury everything down in an interrupt handler.

We can simplify the design by assigning a kernel thread (or userspace daemon) to manage each disk drive. The input to the thread is a queue of disk operation requests that are inserted by higher-level parts of the I/O subsystem, and the thread runs in a loop where it selects an operation to perform (possibly out of order to optimize disk scheduling—we'll talk more about this when we talk about FileSystems), sends the appropriate commands out to the disk controller or DMA controller, and then sleeps until the disk operation is finished (which will be signaled by an interrupt handler whose job it is to wake up the driver thread). The thread can then deliver the results to the requesting process or kernel thread via internal kernel buffers or a high-level IPC mechanism (the former is likely to be faster since it may involve less copying).

The downside to this approach is that it may involve a lot of overhead if we have a lot of disk drives, or if we need to do more coordinating between disk operations (e.g. by scheduling access to the bus). So a more sophisticated system might have a single kernel thread that polls all the drives (with the state of the former multiple threads moved into one big table somewhere) or might do disk operations as part of some regular polling procedure that runs as part of the periodic timer interrupt. There are many options here.


2014-06-17 11:58