Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

A log-structured file system is a method for organizing blocks in a filesystem so that writes are always appended to the end of the filesystem. This has several advantages over traditional approaches:

We get journaling automatically: all file operations are effectively atomic, reducing the need for consistency checking.
We may get versioning as a bonus: old versions of files are still present in the log, so we can undo mistakes and/or recover old versions of files.
For write-intensive applications, the fact that writes all go to the same location may improve write performance substantially (by an order of magnitude in some cases).
There is no free space fragmentation.

On the other hand, we do pay a price:

There is quite a bit of data fragmentation, since updating a few blocks in a file places the new copies at the end of the log, wherever it happens to be.
Some sort of garbage-collection processes is necessary to recover free space from the beginning of the log.

Log-structured filesystems were proposed by Rosenblum and Osterhout in a 1991 SOSP paper (See lfsSOSP91.ps). To date they have not been widely adopted, although some of the ideas have shown up elsewhere.

1. The first step: journaling

Suppose we want to guarantee that file operations maintain consistency even in the presence of power failures, sudden media removal, etc. One approach, derived from techniques used for databases, is to maintain a journal of pending actions. So now if we want to update a file (say by adding a block to the end of it), we

Write an entry in the journal describing the change,
Implement the change in the main filesystem, and
Mark the journal entry as completed, so that the space it occupies can eventually be reused.

Suppose now we pull the plug on the computer somewhere in the middle of this process. At recovery time, instead of scanning the entire disk to look for inconsistencies, we can just look at the uncompleted journal entries and carry out whichever ones have not actually taken effect in the main filesystem. (Any partial journal entries are ignored.) This is the approach, for example, taken in the default Linux ext3 filesystem.

By adopting this approach, we can guarantee consistency with minimal recovery time, but we pay a high price: since we have to write an update to the journal before writing it to disk, every change we make gets written twice. In practice, this problem is usually dealt with my limiting journaling to metadata: directory structure, free lists, etc., but not the actual contents of files. This is a compromise that greatly reduces the size of the journal (and thus the amount of writes that need to be done to maintain it) while still protecting the consistency of the filesystem's internal data structures. However, it allows for the possibility that user data is invisibly corrupted (consider what happens if we journal adding a block to the end of a file before we actually write the block). Some of this corruption can be avoided by carefully scheduling the order of disk operations, but this may conflict with other disk scheduling goals (like not moving the head too much).

2. Ditching the main filesystem

The key idea of Rosenblum and Osterhout's paper was to go one step further and get rid of the second step of updating the main filesystem; indeed, they drop the main filesystem completely, by folding it into the journal. In order to avoid having to reimplement too much, they keep the overall structure (superblock, free list, inodes, indirect blocks, blocks) of the Berkeley Fast File System, but most of these blocks are no longer updated in place but instead are appended to the end of the log in large multi-block segments that can be written very quickly in bulk. This means that particular filesystem elements like inodes may appear in several versions in the log, but only the last version counts. It also means that we need a mechanism to track down blocks that previously lived at fixed locations (e.g. the inode table) but that are now scattered throughout the log.

Below we will describe the specific mechanisms used in SpriteFS, the filesystem described in the paper. Other approaches are possible.

2.1. The inode map

As always, these problems are solved by adding more indirection. An inode map gives the current location of all the inodes. This inode map is itself stored in the log, and updates to the inode map require writing new versions of inode map blocks.

2.2. Checkpoints

So how do we find the inode map? SpriteFS uses a checkpoint region at a fixed location on the disk (specified in the superblock, which is also fixed). This checkpoint region contains pointers to the most recent blocks in the inode map and to the front of checkpointed portion of the log. It does not need to be updated after every log operation; it is possible during recovery to replay any part of the log that extends beyond the last checkpointed entry. So the checkpoint region acts primarily as a backup copy of the real data kept in memory.

In the worst case, if the checkpoint region is lost or damaged, it is possible to recover it from a full scan of the log.

3. Space recovery

With an infinitely large disk, we don't have to worry about recovering space. But we don't have infinitely large disks. So a log-structured filesystem requires a mechanism for recovering free space from old segments at the start of the log. An approach that was considered but discarded by Rosenblum and Osterhout was threading the log through free space within early segments: turning the log into a linked list that overlaps itself in physical block locations. The disadvantage is that the log becomes fragmented and it is no longer possible to use bulk writes. Instead, SpriteFS adopts the approach of cleaning old segments by copying live data to the end of the log and then reclaiming the entire segment.

3.1. Segment summary data

To facilitate this process, each segment contains a summary header that describe which blocks belong to which versions of which inodes. This allows the cleaner to quickly detect out-of-date blocks (since it can check to see if the corresponding inodes are current by looking at the in-memory inode map).

3.2. Data compaction

Live blocks in old segments are copied to the front of the log. There is an opportunity here for data compaction: blocks from the same file can be sorted together to increase locality for later read access. This works especially well if the segment cleaner can process many old segments at once, since it increases the opportunity to sort blocks together.

3.3. Which segments to clean?

The segment cleaner can be selective about which segments it attempts to clean. Cleaning a segment that consists of mostly live data won't gain us much space. So SpriteFS adopts a policy of cleaning segments that are mostly junk first (if there aren't any, the disk is probably full). This means that segments are not necessarily cleaned in the order they are generated, which slightly complicates the process of choosing segments to clean. But it also means that static data doesn't add much to the cost of cleaning, making cleaning costs close to a linear function of update costs. The effect is similar to generational garbage collection strategies in programming language runtimes, where static data is considered less and less often by the GC system.

4. Performance

Log-structured filesystems assume that write performance is more of a constraint than read performance. The justification for this assumption is that read performance can be improved by increasing cache sizes. For write-heavy loads on disks that aren't too full, a log-structured filesystem produces much faster performance than a traditional filesystem because it avoids seeks between block writes—this is even enough to pay for the overhead of segment cleaning and maintaining duplicate log entries. Read performance as observed by Rosenblum and Osterhout was comparable to that for a traditional filesystem. However, for very full disks (requiring frequent cleaning) or for read-heavy loads that could otherwise take advantage of locality, a traditional filesystem would give lower overhead.

For media (like flash drives) that don't have to worry about moving around physical heads, the advantages of doing bulk writes largely evaporate. However, there may still be an incentive to adopt a log-structured approach here since (a) it evenly distributes writes across the medium, which can be an issue for many solid-state devices, (b) it makes it easier to pack variable-sized blocks together, and (c) it allows for the possibility of on-the-fly compression (which otherwise would produce lots of internally fragmented partial blocks). So paradoxically log-structured filesystems like JFFS are currently more likely to be found on flash drives than on hard drives.

5. Why aren't all filesystems log-structured?

Short version: Journaling is good enough for most purposes while maintaining backward compatibility: throw away the journal and you still have a familiar filesystem. E.g. Linux ext2 -> ext3 transition.

CategoryOperatingSystemsNotes