Block devices are I/O devices—principally disk drives and similar mass-storage devices—that are typically accessed by slow operations whose cost is amortized over large blocks of data.
1. The problem with mass storage
Mass storage devices are slow. Typical consumer-grade disk drives have seek times on the order of 10ms, during which a reasonably fast CPU of 2007 can execute about 108 instructions. This alone means that disks must be given special attention to ameliorate their huge costs. There are several additional properties of disks that the OS must work hard to ameliorate:
- There is a long delay between initiating a read or write operation and its completion. This means that the disk I/O subsystem can't block anything else while waiting, and there is an incentive to buffer and cache data as much as possible to hide the delay.
- Disk delays can vary by a wide margin depending on the physical placement of geometry on disk. This encourages careful scheduling of disk operations.
Disks have the role of providing stable storage—storage that survives reboots or power failures—but are nonetheless themselves mechanical devices prone to breakdown. This requires redundancy mechanisms like RAID systems and backup to tertiary storage devices like optical disks or magnetic tape.
2. Hard disk structure
A hard disk consists of one or more platters spinning at 5400–7200 RPM. Each platter has one or more heads that ride above it on a thin cushion of air dragged along by the disk platter (disk drive heads are "ground-effect vehicles" like hovercraft). The heads contain electromagnets that can be used to sense or rewrite patterns of magnetic orientation in the disk platter. The entire mechanism is sealed to prevent dust from entering.
A typical arrangement is to mount all heads on a single arm, so that the move toward or away from the center of the disk in unison. This makes it natural to divide the disk up into cylinders consisting of stacks of circular regions on each platters which are individually called tracks. The tracks are further divided into sectors, each of which stores one block of data. The interface that the disk presents to the outside typically allows the OS to access data by specifying a cylinder (distance from the center of the disk), track (which platter), and sector (distance around the track), giving a three-dimensional coordinate system. These coordinates are usually lies, for two reasons: the disk usually has some extra space to make up for bad spots on the medium, and the disk's internal controller tracks the bad spots and allocates extra sectors to replace them, and most modern disks allocate more sectors to tracks toward the rim of the disk (which have a larger circumference) than to tracks toward the middle; so there will be some translation between constant-sized logical tracks and variable-sized physical tracks. However, as OS writers we can probably count on some correlation between track numbers and physical locations, and it is useful to exploit this correlation to speed up disk access.
The time to read data off the disk is dominated by the time to move the disk head to the right track and the time to wait for the right sector to come around underneath the head.
3. Bad blocks and error correction
It's hard (or at least more expensive) to build a perfect disk. So we can expect our disk platters to have some junky regions that don't hold data very well.
To detect this, we need to use some sort of checksum or error-correcting code. If we notice that some sector is consistently bad, we can put it on a bad blocks list that is stored elsewhere on the disk. Historically, this was often the job of the filesystem, under the assumption that bad blocks were (a) rare and (b) stable. Current high-density disks have enough bad blocks that this task is often handed off to the disk controller. With an error-correcting code, we can tolerate a degraded block (since the ECC fixes the missing bits), and if it consistently misbehaves, we'll cross it off. This is particularly important for devices like Flash drives, where individual sectors can only be written a limited number of times before they become unreliable.
4. Disk scheduling
Most disk drivers reorder disk operations to reduce head movement. This involves the use of a disk scheduling algorithm. Some options:
- First-come first-served
- Exactly what it says. For randomly-arranged requests, requires seeking across half the disk on average. Guarantees no starvation.
- Shortest-seek-time first
- Greedy approach; perform the operation that requires minimum head movement. Easily allows starvation.
- Elevator algorithm
- Head moves in one direction only until it reaches the edge of the disk, then reverses direction. Operations are performed as the head reaches the appropriate track. This theoretically still allows starvation (imagine a continuous stream of operations on a single track that pins the head there), but avoids some of the problems with SSTF. There are several variants depending on whether the head processes requests in both directions, whether it is dumb enough to roll all the way to the edge of the disk even if there is nothing to do there, etc.
- Priority methods
- Like priority scheduling in the CPU: some disk operations are given higher priority than others. Examples would be giving priority to the VM system (especially page writes) or to user processes over low-priority background processes (e.g. antivirus scanners—this sort of disk prioritization is one of the features added to Windows with the Vista release). These can also be used to prevent starvation by assigning high priorities to particularly old requests (e.g., Linux mostly uses SSTF but prioritizes disk requests over 5 seconds old).
5. Block device driver implementation
Because we want to be able to reorder requests, the simple approach to building a device driver where some process locks the device, executes its request, then unlocks the device doesn't work very well. So instead the usual approach is to set up a queue for operations and have the device driver draw a new operation from the queue whenever the previous operation finishes. Users of the driver interact only through the request queue, inserting new operations into the queue (and then possibly blocking waiting for the result). The driver itself is a thread or process that in its simplest form executes a loop that chooses an operation from the queue, signals the disk controller and/or DMA controller to begin executing it, then blocks until the operation completes.
6. Buffering, caching, and prefetching
Because the disk is so slow, a sensible driver will buffer disk blocks in main memory. This allows write operations to appear to finish immediately; if some other process asks to read the block just written, the disk driver returns the buffered block without waiting to write it to disk (or read it back). The disk driver can also maintain in the same space a pool of recently read blocks, so that it is not necessary to go to the disk if the some process asks for one again. Most modern kernels will use any spare available memory for this purpose—it's cheap, because we have to read the incoming disk blocks in somewhere, and since the data is backed by the disk we can always through these cached blocks away.
A further extension of this approach is prefetching. If the disk driver (or higher-level filesystem) has a good model of what blocks will be needed, it can fetch and cache these blocks off of the disk when it is cheap to do so or the disk is otherwise idle. This can vary from simple approaches like reading the next N sectors after the one that is requested (under the assumption that many file reads grab contiguous blocks of data) to recording detailed statistical data about what blocks are needed when and prefetching them into memory (e.g., Windows Vista's SuperFetch). Even a small reduction in the number of unpredicted disk fetches can noticeably reduce latency.
7. Network-attached storage
- Move disks to the other side of a network.
- Disk access now goes a network protocol instead of through the system bus.
- Disks can be run by another general-purpose computer or by an NAS device.
Most common approach is to do this at the FileSystem level e.g. with NFS.
8. Removable storage
- E.g., USB keys, Flash/SD/MMC cards, tapes.
- Mounting and unmounting: need to flush buffers before removing the device.
Historically mostly used as slow but cheap tertiary storage; also used for data portability.
Nothing beats the bandwidth of a station wagon full of tapes—still true?
9. Multiple disks
- Striping: split sectors across N different disks.
- Now we can read them N times as fast!
- Except failures happen N times more often.
- Solution: RAID (Redundant Array of Inexpensive Disks)
Elaborate catalog of RAID architectures or levels (see SGG §12.7.3).
- Basic idea: duplicate data so we can reconstruct it after one or two disks fail.
- Naive approach: make extra copies
- Sophisticated approach: use error-correcting codes
- Real-world example: the Google File System uses the naive approach.
10. Other block devices
Things that look like disks but aren't:
- RAM disks.
- Loopback devices.