Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

A filesystem presents an abstract interface to one or more underlying BlockDevices. It may also provide a namespace for other OS mechanisms.

It is possible to have an OS that doesn't provide a filesystem. We let user processes access disks directly, possibly with some very minimal access control like assigning different ranges on the disk to different processes. But essentially all usable operating systems provide a filesystem as a basic resource, and most use it enough internally that they couldn't run without it.

1. Interface

We'll start with the user's view and then move on to implementation issues later.

1.1. Files

1.1.1. Structure

Big bag of bytes
- No OS-enforced internal structure to files
- Modern default
- Popularized by Unix
Records
- File is a sequence of records
- Records have fixed length (ancient mainframe OSes) or variable length
  - Fixed length allows fast random access
- Survives in vestigial form even in modern OSes
  - gets, puts
  - Standard record separators: "\n" (Unix), "\r\n" (MS-DOS/Windows), "\r" (old MacOS).
Tree structure
- File consists of named components, each of which in turn may consist of named components
- Replaced by directory trees in modern OSes
- Possible return via XML?
Text vs binary files
- Some OSes distinguish between text and binary files
- Allows for special handling of text files, e.g. record separator translation.

1.1.2. Naming

Structured names
- MS-DOS style 8+3: MZFIRFOX.EXE, WDDOMPLN.TXT
  - Case-insensitive
  - Limited alphabet: alphanumerics plus '-' and '_'
  - Three-character extension indicates type
    - Can be use to choose which program can work on the file
  - Selling point: fit nicely in 16-byte directory table slots
- VMS: KREMVAX::BIGDISK:DOSSIERS:GORBACHEV.DOC;37
  - Starts with node name and disk name (allows references to remote files).
  - Extension (e.g. DOC) indicates type.
  - Number at the end is a version number, automatically maintained by the filesystem.
  - Directories can be tucked in the middle.
Less structured names
- Unix filenames: up to 255 arbitrary characters
  - Case-sensitive: file, FILE, and FilE are all different files
  - Extensions if any are just more characters, interpreted by programs at their option
  - foo.c, archive.tar.bz2.jpg
  - Name may give no hint about contents or purpose of a file!
- NTFS
  - Filenames in Unicode: φιλε
  - Some characters reserved
See Comparison_of_file_systems for many more exotic examples

1.1.3. Metadata

Data associated with a file but not part of a file
Core system stuff
- Permissions
- Access times
- Location
- Size
- File content type (text/binary)
Type information
- E.g. Windows extensions
- MacOS forks/NTFS Alternate Data Streams
  - Original purpose: tell the GUI how to depict the file
- Unix kludge: magic numbers
  - Fixed header bytes at the start of a file
  - Not provided by the file system
  - Not centrally administered
  - file program can make guesses about file types based on well-known magic numbers
User-defined metadata

1.1.4. Operations

creat, open, close, read, write, lseek, remove, truncate
Locking?

1.1.5. Access methods

Sequential: read/write the next block
Direct: read/write block number N
Distinction is usually weak in practice
- Can fake sequential access given direct access by tracking position ourselves
- Most sequential access implementations provide seek operation that can be used to simulate direct access
ISAM

1.2. Directories

Also called folders.

Basic model: maps names to files
- Usual hash-table operations: insert, delete, lookup, list
Multiple directories: flat vs tree vs graph structure
Can a file appear in more than one directory (or twice in the same directory)?
- If so, who gets the metadata?
  - Associate most metadata with the file.
  - Name information goes in directory.
- May require garbage collection
Directories and naming
- Root directories
- Pathnames interpreted by the kernel
- Relative vs absolute pathnames
Symbolic links
- Directory entry that refers to a pathname rather than a physical file
- Used in Unix as work-around for limitations on hard links
  - Can symlink to a directory
  - Can symlink across devices
- Problem: no guarantee that target exists
- Intepretation is not necessarily fixed
  - E.g., Windows Vista allows remote symlinks to refer to local files
  - On Linux: ln -s http://www.yale.edu/ yale-www
    - This works, but filesystem doesn't know how to follow the link.
    - Nobody actually does this.
- Many early OSes didn't have them and actively resisted including them

1.3. Mount points

Idea is to combine multiple filesystems in a single directory tree.
A filesystem is mounted on a mount point
- E.g. mount /dev/hda2 /overflow
- Mount point is an existing directory in an already-mounted filesystem
- Mount operation initializes new filesystem, sets up internal kernel data structures
- Kernel redirects operations on this directory to new filesystem
- Root file system must be handled specially
Unmounting
- Typically requires closing any open files on the filesystem
- Flushes buffers and clears internal kernel data structures

1.4. Special files

In addition to symlinks and mount points, various other special files may appear in a filesystem.

Device files e.g. /dev/hda, /dev/console: map pathnames to (major,minor) device number pairs.
Named pipes (mkfifo in Unix, pipefs filesystem in Windows)
Unix-domain sockets

Mostly these allow using the standard file interface to access devices or other mechanisms directly, e.g. echo "now you're in trouble" > /dev/hda,¹ wc /dev/mem.

1.5. Special file systems

Nothing in the file system interface says that the files have to actually exist. Assuming the kernel's internal structure is flexible enough, this allows for many kinds of virtual file systems that extend the basic files-on-disk model. Some examples:

1.5.1. /proc

Originally developed for the experimental OS Plan_9_from_Bell_Labs, but now present in most Unix-like OSes.

The /proc filesystem contains a subdirectory for each running process and virtual files representing system information. Process subdirectories allow reading command lines, detecting the executing program, looking at open files (all subject to access control). System files allow reading system information, e.g. here is /proc/cpuinfo for pine.cs.yale.edu:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      :               Intel(R) Pentium(R) 4 CPU 3.80GHz
stepping        : 3
cpu MHz         : 3790.778
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl est tm2 cid cx16 xtpr
bogomips        : 7588.59
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      :               Intel(R) Pentium(R) 4 CPU 3.80GHz
stepping        : 3
cpu MHz         : 3790.778
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl est tm2 cid cx16 xtpr
bogomips        : 7581.31
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

None of these files actually exist as permanent stored objects; instead, when a process opens and reads them the data is generated on the fly by the /proc filesystem driver.

1.5.2. Archival file systems

E.g. Plan 9's Fossil, allowing cat /snapshot/2004/02/01/123000/home/user/important-document.txt. Here the first few directory names are actually timestamps, with the interpretation of finding the most recent copy of /home/user/important-document.txt available as of 12:30:00 on 2004-02-01.

1.5.3. Distributed file systems

E.g. NFS, SMB/CFIS, AFS. Here files are stored on a remote server (possibly several remote servers). Local requests for file operations are redirected across the network. We'll talk more about these when we talk about DistributedSystems.

1.5.4. Unions

Some filesystems allow several directories (or several filesystems) to be mounted in the same place; this presents to the user the union of the contents of these directories, with typically a priority scheme to determine what to do in the case of a conflict.

1.5.5. Encrypted filesystems

(Exactly what they sound like.)

1.6. Access control

It's natural to limit users' access to particular files or directories in the filesystem, both for privacy on shared systems and to keep users from accidentally or maliciously damaging system components that happen to be stored in the filesystem. Any modern filesystem will provide at minimum some approximation to POSIX access control bits (the classic rwxrwxrwx inherited from Unix), where each bit controls whether the owner of the file, the members of a particular group of users associated with the file, or an arbitrary user can read/write/execute the file. Users are identified by user ids, which are set by the system at login time.

Additional bits are provided to support other features, like the setuid bit that allows a user to execute a file with the effective permissions of the owner or the sticky bit that allows only the owner of a file to remove it from an otherwise world-writable directory.

POSIX-style permissions are a compromise between generality and simplicity. One disadvantage is that it is difficult to grant fine-grained access to a file to specific users. If I want George and Martha to be able to edit one of my files, I can give them group access provided there is a group already defined in the system that contains just George and Martha (and possibly me). But defining groups typically requires special privileges not available to normal users.

Many systems thus extend the basic POSIX framework by adding access control lists or ACLs, specific lists of per-user or per-group permissions associated with individual files or directories that override the basic access control rules. Thus I can specifically grant George the ability to read or write my file while granting Martha only the ability to read it. Such features are required for certain security certifications, but are often left out in Unix-like systems. This has led to a lack of standardization of ACL interfaces; attempts to include access control lists in POSIX failed for precisely this reason. It also means that users don't expect to be able to use ACLs, which lets implementers off the hook for building them, an example of the chicken-and-egg problem that prevents new operating system features from being widely deployed unless they provide functionality much more useful or efficient than the default.

When the filesystem provides access control, this gives a natural way to provide access control to hardware devices that are themselves accessed through the filesystem. So for example we can turn off all the speakers in the Zoo machines by the simple expedient of turning off write access to /dev/audio and /dev/dsp; or, in a context where speaker use is less annoying, we might restrict access only to members of group audio.

2. Implementation

2.1. Disk layout

Master Boot Record (MBR) sits at standard location, lists partitions.
Each partition acts like a miniature disk with its own filesystem.
- Boot block for bootable partitions
- Superblock with key filesystem info
- Remaining blocks allocated to directories, inodes, free lists, files.

2.1.1. Block size

Want to amortize cost of reading over many bytes
Fixed-size blocks make life easier
Trade-off on block size
- Big blocks: lots of empty space
- Small blocks: lots of pointers
Frequently-used 4K block size fits nicely with typical file size.

2.1.2. Tracking blocks in a file

Contiguous allocation

Each file is stored as a sequence of consecutive blocks
No need for indexing: just track first and last blocks
Very good for fast sequential reads
Very bad for fragmentation

Linked lists

Each block stores a pointer to the next block.
Doesn't work well for random access.
Blocks end up with weird sizes, e.g. 2¹²-4 = 4092 bytes, forcing filesystem to do division instead of bit operations.

File-allocation table

Separate linked lists from files
Still not very good for random access
- But maybe OK if we store the whole FAT in memory
FAT can be very big

Index nodes:

Give each file an inode, an array of pointers to blocks
Supports random access
Good for memory: don't need to load inodes for closed files
Not so good for short files (inodes will be mostly empty)
Maybe not so good for big files: run out of space in the inode!

Multilayer Unix approach:

Inode stores file attributes (but not the filename, which goes in a directory)
Inode also stores pointers to first few blocks of file.
3rd to last pointer points to single indirect block, which points to blocks
2nd to last pointer points to double indirect block, which points to blocks that point to blocks
Last pointer points to triple indirect block, which points to blocks that point to blocks that point to blocks. (After this we are out of address bits.)
Still not too good for very short files
- Inodes are in fixed location: can run out of inodes before running out of disk space

Fancier mechanisms:

E.g. B+-tree in ReiserFS

Log-structured filesystems

Will get their own page, see LogStructuredFilesystem

2.1.3. Tracking free blocks

The basic choice here is a linked list vs a bit vector. Bit vectors have the advantage of allowing smarter allocation (since the block allocator can see all the blocks that are free), but the disadvantage of requiring more space when there aren't many free blocks. Linked lists can be stored in any already-existing file-allocation table if we have one, or can be stored in the free blocks themselves.

A tempting solution that doesn't work so well is to allocate all free blocks to one giant dummy file; the problem is that we are likely to have to go through several layers of indirection (and thus multiple slow disk accesses) to find a free block.

2.1.4. Directories

Table mapping names to inodes/FAT entries/other directories.
May also contain file attributes.
Implementation issues
- Filename length
  - Fixed length: makes for simple structure, but wastes space
  - Variable length: often implemented as ISAM structure with pointers into packed table of names
  - Hybrid approach (e.g. as used in Windows FAT): store short names in main table, but allow pointers to extensions stored elsewhere if appropriate bit is set.
    - Largely a retrofit in FAT, but has led to entertaining patent battles.
- Data structure
  - Classic approach: unsorted array
    - Low overhead
    - Slow for big directories
  - Better approaches: hashing, B-trees

2.1.5. Physical layout

To be fast, must respect disk geometry
Berkeley Fast File System
- Cylinder groups
  - Contain backup superblock, inodes, file blocks
  - Inodes and related file blocks are contained to a cylinder group if possible
Windows
- Move frequently-accessed files to middle of the disk

2.2. Pathname translation

Root directory (typically a mount point stored in mount table)
Each pathname component traverses one more directory
Mount points are intercepted before examining corresponding directory
Virtual file systems may take over pathname translation and do their own nastiness

2.3. Caching

Maintain pool of cached disk blocks
- Typically done below filesystem level
Consistency issues
- Flush important blocks first (e.g. superblock, inodes)
- Write-through: flush all dirty blocks immediately
  - MS-DOS approach
  - Maintains consistency with removable media
- Sync: flush all dirty blocks from time to time
  - Unix approach
  - sync system call called by user, or periodically by update daemon
  - Guarantees all data is updated after at most k seconds for some small k.
  - Doesn't guarantee consistency if user can remove the disk or if the power fails.
- Log-structure filesystems (we'll come back to these)

3. Consistency checking

Filesystems are subject to bit rot.
Consistency checker (fsck, chkdisk) fixes inconsistencies)
- Examples of inconsistencies:
  - Same block in multiple files
  - Blocks not in free list
  - Directory pointers to nowhere
- Often run at boot time (interacts badly with mounted disks)
- Similar to garbage-collection
- Note: only guarantees consistency, doesn't guarantee you get your data back

3.1. How inconsistencies arise

Rare: media failure
More common: disk operations are not atomic
- E.g. consider deleting a file:
  - Remove directory entry
  - Add blocks to free list
  - Update statistics in superblock
- Failure during this process can leave inconsistent state

3.2. Recovering from inconsistencies

Run fsck
- Essentially a garbage collector
- Detects inaccessible blocks that aren't in free list
- Detects accessible blocks that are in free list
- Puts orphaned files in some standard location (e.g. /lost+found).
- Checks other file attributes
  - E.g. bizarre modification times

3.3. Preventing inconsistencies

Be careful about order of disk operations
- E.g. in deleting a file:
  - Update directory entry first
  - If we get a failure, worst that happens is a storage leak
- E.g. expanding a file:
  - Update free list
  - Then write new blocks
  - Then update inode/indirect blocks
- May conflict with disk scheduling optimization
- Soft updates
  - Allow disk scheduler freedom to order writes for operations on different files, but enforce consistency for individual files
- Not perfect if disk hardware reorders writes
Enforce atomicity:
- Journaling
- LogStructuredFilesystem

4. More info

There is a detailed on-line book on filesystem implementation at http://www.letterp.com/~dbg/practical-file-system-design.pdf. (Thanks to Philip Taff for pointing this out.)

CategoryOperatingSystemsNotes

Don't do this. (1)