[FrontPage] [TitleIndex] [WordIndex

Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

A filesystem presents an abstract interface to one or more underlying BlockDevices. It may also provide a namespace for other OS mechanisms.

It is possible to have an OS that doesn't provide a filesystem. We let user processes access disks directly, possibly with some very minimal access control like assigning different ranges on the disk to different processes. But essentially all usable operating systems provide a filesystem as a basic resource, and most use it enough internally that they couldn't run without it.

1. Interface

We'll start with the user's view and then move on to implementation issues later.

1.1. Files

1.1.1. Structure

1.1.2. Naming

1.1.3. Metadata

1.1.4. Operations

1.1.5. Access methods

1.2. Directories

Also called folders.

1.3. Mount points

1.4. Special files

In addition to symlinks and mount points, various other special files may appear in a filesystem.

Mostly these allow using the standard file interface to access devices or other mechanisms directly, e.g. echo "now you're in trouble" > /dev/hda,1 wc /dev/mem.

1.5. Special file systems

Nothing in the file system interface says that the files have to actually exist. Assuming the kernel's internal structure is flexible enough, this allows for many kinds of virtual file systems that extend the basic files-on-disk model. Some examples:

1.5.1. /proc

Originally developed for the experimental OS Plan_9_from_Bell_Labs, but now present in most Unix-like OSes.

The /proc filesystem contains a subdirectory for each running process and virtual files representing system information. Process subdirectories allow reading command lines, detecting the executing program, looking at open files (all subject to access control). System files allow reading system information, e.g. here is /proc/cpuinfo for pine.cs.yale.edu:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      :               Intel(R) Pentium(R) 4 CPU 3.80GHz
stepping        : 3
cpu MHz         : 3790.778
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl est tm2 cid cx16 xtpr
bogomips        : 7588.59
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      :               Intel(R) Pentium(R) 4 CPU 3.80GHz
stepping        : 3
cpu MHz         : 3790.778
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl est tm2 cid cx16 xtpr
bogomips        : 7581.31
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

None of these files actually exist as permanent stored objects; instead, when a process opens and reads them the data is generated on the fly by the /proc filesystem driver.

1.5.2. Archival file systems

E.g. Plan 9's Fossil, allowing cat /snapshot/2004/02/01/123000/home/user/important-document.txt. Here the first few directory names are actually timestamps, with the interpretation of finding the most recent copy of /home/user/important-document.txt available as of 12:30:00 on 2004-02-01.

1.5.3. Distributed file systems

E.g. NFS, SMB/CFIS, AFS. Here files are stored on a remote server (possibly several remote servers). Local requests for file operations are redirected across the network. We'll talk more about these when we talk about DistributedSystems.

1.5.4. Unions

Some filesystems allow several directories (or several filesystems) to be mounted in the same place; this presents to the user the union of the contents of these directories, with typically a priority scheme to determine what to do in the case of a conflict.

1.5.5. Encrypted filesystems

(Exactly what they sound like.)

1.6. Access control

It's natural to limit users' access to particular files or directories in the filesystem, both for privacy on shared systems and to keep users from accidentally or maliciously damaging system components that happen to be stored in the filesystem. Any modern filesystem will provide at minimum some approximation to POSIX access control bits (the classic rwxrwxrwx inherited from Unix), where each bit controls whether the owner of the file, the members of a particular group of users associated with the file, or an arbitrary user can read/write/execute the file. Users are identified by user ids, which are set by the system at login time.

Additional bits are provided to support other features, like the setuid bit that allows a user to execute a file with the effective permissions of the owner or the sticky bit that allows only the owner of a file to remove it from an otherwise world-writable directory.

POSIX-style permissions are a compromise between generality and simplicity. One disadvantage is that it is difficult to grant fine-grained access to a file to specific users. If I want George and Martha to be able to edit one of my files, I can give them group access provided there is a group already defined in the system that contains just George and Martha (and possibly me). But defining groups typically requires special privileges not available to normal users.

Many systems thus extend the basic POSIX framework by adding access control lists or ACLs, specific lists of per-user or per-group permissions associated with individual files or directories that override the basic access control rules. Thus I can specifically grant George the ability to read or write my file while granting Martha only the ability to read it. Such features are required for certain security certifications, but are often left out in Unix-like systems. This has led to a lack of standardization of ACL interfaces; attempts to include access control lists in POSIX failed for precisely this reason. It also means that users don't expect to be able to use ACLs, which lets implementers off the hook for building them, an example of the chicken-and-egg problem that prevents new operating system features from being widely deployed unless they provide functionality much more useful or efficient than the default.

When the filesystem provides access control, this gives a natural way to provide access control to hardware devices that are themselves accessed through the filesystem. So for example we can turn off all the speakers in the Zoo machines by the simple expedient of turning off write access to /dev/audio and /dev/dsp; or, in a context where speaker use is less annoying, we might restrict access only to members of group audio.

2. Implementation

2.1. Disk layout

2.1.1. Block size

2.1.2. Tracking blocks in a file

Contiguous allocation
  • Each file is stored as a sequence of consecutive blocks
  • No need for indexing: just track first and last blocks
  • Very good for fast sequential reads
  • Very bad for fragmentation
Linked lists
  • Each block stores a pointer to the next block.
  • Doesn't work well for random access.
  • Blocks end up with weird sizes, e.g. 212-4 = 4092 bytes, forcing filesystem to do division instead of bit operations.

File-allocation table
  • Separate linked lists from files
  • Still not very good for random access
    • But maybe OK if we store the whole FAT in memory
  • FAT can be very big
Index nodes:
  • Give each file an inode, an array of pointers to blocks

  • Supports random access
  • Good for memory: don't need to load inodes for closed files
  • Not so good for short files (inodes will be mostly empty)
  • Maybe not so good for big files: run out of space in the inode!
Multilayer Unix approach:
  • Inode stores file attributes (but not the filename, which goes in a directory)
  • Inode also stores pointers to first few blocks of file.
  • 3rd to last pointer points to single indirect block, which points to blocks

  • 2nd to last pointer points to double indirect block, which points to blocks that point to blocks

  • Last pointer points to triple indirect block, which points to blocks that point to blocks that point to blocks. (After this we are out of address bits.)

  • Still not too good for very short files
    • Inodes are in fixed location: can run out of inodes before running out of disk space
Fancier mechanisms:
Log-structured filesystems

2.1.3. Tracking free blocks

The basic choice here is a linked list vs a bit vector. Bit vectors have the advantage of allowing smarter allocation (since the block allocator can see all the blocks that are free), but the disadvantage of requiring more space when there aren't many free blocks. Linked lists can be stored in any already-existing file-allocation table if we have one, or can be stored in the free blocks themselves.

A tempting solution that doesn't work so well is to allocate all free blocks to one giant dummy file; the problem is that we are likely to have to go through several layers of indirection (and thus multiple slow disk accesses) to find a free block.

2.1.4. Directories

2.1.5. Physical layout

2.2. Pathname translation

2.3. Caching

3. Consistency checking

3.1. How inconsistencies arise

3.2. Recovering from inconsistencies

3.3. Preventing inconsistencies

4. More info

There is a detailed on-line book on filesystem implementation at http://www.letterp.com/~dbg/practical-file-system-design.pdf. (Thanks to Philip Taff for pointing this out.)


CategoryOperatingSystemsNotes

  1. Don't do this. (1)


2014-06-17 11:58