A filesystem presents an abstract interface to one or more underlying BlockDevices. It may also provide a namespace for other OS mechanisms.
It is possible to have an OS that doesn't provide a filesystem. We let user processes access disks directly, possibly with some very minimal access control like assigning different ranges on the disk to different processes. But essentially all usable operating systems provide a filesystem as a basic resource, and most use it enough internally that they couldn't run without it.
1. Interface
We'll start with the user's view and then move on to implementation issues later.
1.1. Files
1.1.1. Structure
- Big bag of bytes
- No OS-enforced internal structure to files
- Modern default
- Popularized by Unix
- Records
- File is a sequence of records
- Records have fixed length (ancient mainframe OSes) or variable length
- Fixed length allows fast random access
- Survives in vestigial form even in modern OSes
gets, puts
- Standard record separators: "\n" (Unix), "\r\n" (MS-DOS/Windows), "\r" (old MacOS).
- Tree structure
- File consists of named components, each of which in turn may consist of named components
- Replaced by directory trees in modern OSes
- Possible return via XML?
- Text vs binary files
- Some OSes distinguish between text and binary files
- Allows for special handling of text files, e.g. record separator translation.
1.1.2. Naming
- Structured names
MS-DOS style 8+3: MZFIRFOX.EXE, WDDOMPLN.TXT
- Case-insensitive
- Limited alphabet: alphanumerics plus '-' and '_'
- Three-character extension indicates type
- Can be use to choose which program can work on the file
- Selling point: fit nicely in 16-byte directory table slots
VMS: KREMVAX::BIGDISK:DOSSIERS:GORBACHEV.DOC;37
- Starts with node name and disk name (allows references to remote files).
Extension (e.g. DOC) indicates type.
- Number at the end is a version number, automatically maintained by the filesystem.
- Directories can be tucked in the middle.
- Less structured names
- Unix filenames: up to 255 arbitrary characters
Case-sensitive: file, FILE, and FilE are all different files
- Extensions if any are just more characters, interpreted by programs at their option
foo.c, archive.tar.bz2.jpg
- Name may give no hint about contents or purpose of a file!
- NTFS
Filenames in Unicode: φιλε
- Some characters reserved
- Unix filenames: up to 255 arbitrary characters
See Comparison_of_file_systems for many more exotic examples
1.1.3. Metadata
- Data associated with a file but not part of a file
- Core system stuff
- Permissions
- Access times
- Location
- Size
- File content type (text/binary)
- Type information
- E.g. Windows extensions
- MacOS forks/NTFS Alternate Data Streams
- Original purpose: tell the GUI how to depict the file
- Unix kludge: magic numbers
- Fixed header bytes at the start of a file
- Not provided by the file system
- Not centrally administered
file program can make guesses about file types based on well-known magic numbers
- User-defined metadata
1.1.4. Operations
- creat, open, close, read, write, lseek, remove, truncate
- Locking?
1.1.5. Access methods
- Sequential: read/write the next block
- Direct: read/write block number N
- Distinction is usually weak in practice
- Can fake sequential access given direct access by tracking position ourselves
- Most sequential access implementations provide seek operation that can be used to simulate direct access
1.2. Directories
Also called folders.
- Basic model: maps names to files
- Usual hash-table operations: insert, delete, lookup, list
- Multiple directories: flat vs tree vs graph structure
- Can a file appear in more than one directory (or twice in the same directory)?
- If so, who gets the metadata?
- Associate most metadata with the file.
- Name information goes in directory.
- May require garbage collection
- If so, who gets the metadata?
- Directories and naming
- Root directories
- Pathnames interpreted by the kernel
- Relative vs absolute pathnames
- Symbolic links
- Directory entry that refers to a pathname rather than a physical file
Used in Unix as work-around for limitations on hard links
- Can symlink to a directory
- Can symlink across devices
- Problem: no guarantee that target exists
- Intepretation is not necessarily fixed
- E.g., Windows Vista allows remote symlinks to refer to local files
On Linux: ln -s http://www.yale.edu/ yale-www
- This works, but filesystem doesn't know how to follow the link.
- Nobody actually does this.
- Many early OSes didn't have them and actively resisted including them
1.3. Mount points
- Idea is to combine multiple filesystems in a single directory tree.
A filesystem is mounted on a mount point
E.g. mount /dev/hda2 /overflow
- Mount point is an existing directory in an already-mounted filesystem
- Mount operation initializes new filesystem, sets up internal kernel data structures
- Kernel redirects operations on this directory to new filesystem
- Root file system must be handled specially
- Unmounting
- Typically requires closing any open files on the filesystem
- Flushes buffers and clears internal kernel data structures
1.4. Special files
In addition to symlinks and mount points, various other special files may appear in a filesystem.
Device files e.g. /dev/hda, /dev/console: map pathnames to (major,minor) device number pairs.
Named pipes (mkfifo in Unix, pipefs filesystem in Windows)
- Unix-domain sockets
Mostly these allow using the standard file interface to access devices or other mechanisms directly, e.g. echo "now you're in trouble" > /dev/hda,1 wc /dev/mem.
1.5. Special file systems
Nothing in the file system interface says that the files have to actually exist. Assuming the kernel's internal structure is flexible enough, this allows for many kinds of virtual file systems that extend the basic files-on-disk model. Some examples:
1.5.1. /proc
Originally developed for the experimental OS Plan_9_from_Bell_Labs, but now present in most Unix-like OSes.
The /proc filesystem contains a subdirectory for each running process and virtual files representing system information. Process subdirectories allow reading command lines, detecting the executing program, looking at open files (all subject to access control). System files allow reading system information, e.g. here is /proc/cpuinfo for pine.cs.yale.edu:
processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Pentium(R) 4 CPU 3.80GHz stepping : 3 cpu MHz : 3790.778 cache size : 2048 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl est tm2 cid cx16 xtpr bogomips : 7588.59 clflush size : 64 cache_alignment : 128 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Pentium(R) 4 CPU 3.80GHz stepping : 3 cpu MHz : 3790.778 cache size : 2048 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl est tm2 cid cx16 xtpr bogomips : 7581.31 clflush size : 64 cache_alignment : 128 address sizes : 36 bits physical, 48 bits virtual power management:
None of these files actually exist as permanent stored objects; instead, when a process opens and reads them the data is generated on the fly by the /proc filesystem driver.
1.5.2. Archival file systems
E.g. Plan 9's Fossil, allowing cat /snapshot/2004/02/01/123000/home/user/important-document.txt. Here the first few directory names are actually timestamps, with the interpretation of finding the most recent copy of /home/user/important-document.txt available as of 12:30:00 on 2004-02-01.
1.5.3. Distributed file systems
E.g. NFS, SMB/CFIS, AFS. Here files are stored on a remote server (possibly several remote servers). Local requests for file operations are redirected across the network. We'll talk more about these when we talk about DistributedSystems.
1.5.4. Unions
Some filesystems allow several directories (or several filesystems) to be mounted in the same place; this presents to the user the union of the contents of these directories, with typically a priority scheme to determine what to do in the case of a conflict.
1.5.5. Encrypted filesystems
(Exactly what they sound like.)
1.6. Access control
It's natural to limit users' access to particular files or directories in the filesystem, both for privacy on shared systems and to keep users from accidentally or maliciously damaging system components that happen to be stored in the filesystem. Any modern filesystem will provide at minimum some approximation to POSIX access control bits (the classic rwxrwxrwx inherited from Unix), where each bit controls whether the owner of the file, the members of a particular group of users associated with the file, or an arbitrary user can read/write/execute the file. Users are identified by user ids, which are set by the system at login time.
Additional bits are provided to support other features, like the setuid bit that allows a user to execute a file with the effective permissions of the owner or the sticky bit that allows only the owner of a file to remove it from an otherwise world-writable directory.
POSIX-style permissions are a compromise between generality and simplicity. One disadvantage is that it is difficult to grant fine-grained access to a file to specific users. If I want George and Martha to be able to edit one of my files, I can give them group access provided there is a group already defined in the system that contains just George and Martha (and possibly me). But defining groups typically requires special privileges not available to normal users.
Many systems thus extend the basic POSIX framework by adding access control lists or ACLs, specific lists of per-user or per-group permissions associated with individual files or directories that override the basic access control rules. Thus I can specifically grant George the ability to read or write my file while granting Martha only the ability to read it. Such features are required for certain security certifications, but are often left out in Unix-like systems. This has led to a lack of standardization of ACL interfaces; attempts to include access control lists in POSIX failed for precisely this reason. It also means that users don't expect to be able to use ACLs, which lets implementers off the hook for building them, an example of the chicken-and-egg problem that prevents new operating system features from being widely deployed unless they provide functionality much more useful or efficient than the default.
When the filesystem provides access control, this gives a natural way to provide access control to hardware devices that are themselves accessed through the filesystem. So for example we can turn off all the speakers in the Zoo machines by the simple expedient of turning off write access to /dev/audio and /dev/dsp; or, in a context where speaker use is less annoying, we might restrict access only to members of group audio.
2. Implementation
2.1. Disk layout
Master Boot Record (MBR) sits at standard location, lists partitions.
- Each partition acts like a miniature disk with its own filesystem.
Boot block for bootable partitions
Superblock with key filesystem info
- Remaining blocks allocated to directories, inodes, free lists, files.
2.1.1. Block size
- Want to amortize cost of reading over many bytes
- Fixed-size blocks make life easier
- Trade-off on block size
- Big blocks: lots of empty space
- Small blocks: lots of pointers
- Frequently-used 4K block size fits nicely with typical file size.
2.1.2. Tracking blocks in a file
- Contiguous allocation
- Each file is stored as a sequence of consecutive blocks
- No need for indexing: just track first and last blocks
- Very good for fast sequential reads
- Very bad for fragmentation
- Linked lists
- Each block stores a pointer to the next block.
- Doesn't work well for random access.
Blocks end up with weird sizes, e.g. 212-4 = 4092 bytes, forcing filesystem to do division instead of bit operations.
- File-allocation table
- Separate linked lists from files
- Still not very good for random access
- But maybe OK if we store the whole FAT in memory
- FAT can be very big
Give each file an inode, an array of pointers to blocks
- Supports random access
- Good for memory: don't need to load inodes for closed files
- Not so good for short files (inodes will be mostly empty)
- Maybe not so good for big files: run out of space in the inode!
- Inode stores file attributes (but not the filename, which goes in a directory)
- Inode also stores pointers to first few blocks of file.
3rd to last pointer points to single indirect block, which points to blocks
2nd to last pointer points to double indirect block, which points to blocks that point to blocks
Last pointer points to triple indirect block, which points to blocks that point to blocks that point to blocks. (After this we are out of address bits.)
- Still not too good for very short files
- Inodes are in fixed location: can run out of inodes before running out of disk space
E.g. B+-tree in ReiserFS
- Log-structured filesystems
Will get their own page, see LogStructuredFilesystem
2.1.3. Tracking free blocks
The basic choice here is a linked list vs a bit vector. Bit vectors have the advantage of allowing smarter allocation (since the block allocator can see all the blocks that are free), but the disadvantage of requiring more space when there aren't many free blocks. Linked lists can be stored in any already-existing file-allocation table if we have one, or can be stored in the free blocks themselves.
A tempting solution that doesn't work so well is to allocate all free blocks to one giant dummy file; the problem is that we are likely to have to go through several layers of indirection (and thus multiple slow disk accesses) to find a free block.
2.1.4. Directories
- Table mapping names to inodes/FAT entries/other directories.
- May also contain file attributes.
- Implementation issues
- Filename length
- Fixed length: makes for simple structure, but wastes space
- Variable length: often implemented as ISAM structure with pointers into packed table of names
- Hybrid approach (e.g. as used in Windows FAT): store short names in main table, but allow pointers to extensions stored elsewhere if appropriate bit is set.
- Largely a retrofit in FAT, but has led to entertaining patent battles.
- Data structure
- Classic approach: unsorted array
- Low overhead
- Slow for big directories
- Better approaches: hashing, B-trees
- Classic approach: unsorted array
- Filename length
2.1.5. Physical layout
- To be fast, must respect disk geometry
- Berkeley Fast File System
- Cylinder groups
- Contain backup superblock, inodes, file blocks
- Inodes and related file blocks are contained to a cylinder group if possible
- Cylinder groups
- Windows
- Move frequently-accessed files to middle of the disk
2.2. Pathname translation
- Root directory (typically a mount point stored in mount table)
- Each pathname component traverses one more directory
- Mount points are intercepted before examining corresponding directory
- Virtual file systems may take over pathname translation and do their own nastiness
2.3. Caching
- Maintain pool of cached disk blocks
- Typically done below filesystem level
- Consistency issues
- Flush important blocks first (e.g. superblock, inodes)
- Write-through: flush all dirty blocks immediately
- MS-DOS approach
- Maintains consistency with removable media
- Sync: flush all dirty blocks from time to time
- Unix approach
sync system call called by user, or periodically by update daemon
- Guarantees all data is updated after at most k seconds for some small k.
- Doesn't guarantee consistency if user can remove the disk or if the power fails.
- Log-structure filesystems (we'll come back to these)
3. Consistency checking
- Filesystems are subject to bit rot.
Consistency checker (fsck, chkdisk) fixes inconsistencies)
- Examples of inconsistencies:
- Same block in multiple files
- Blocks not in free list
- Directory pointers to nowhere
- Often run at boot time (interacts badly with mounted disks)
- Similar to garbage-collection
- Note: only guarantees consistency, doesn't guarantee you get your data back
- Examples of inconsistencies:
3.1. How inconsistencies arise
- Rare: media failure
- More common: disk operations are not atomic
- E.g. consider deleting a file:
- Remove directory entry
- Add blocks to free list
- Update statistics in superblock
- Failure during this process can leave inconsistent state
- E.g. consider deleting a file:
3.2. Recovering from inconsistencies
Run fsck
- Essentially a garbage collector
- Detects inaccessible blocks that aren't in free list
- Detects accessible blocks that are in free list
Puts orphaned files in some standard location (e.g. /lost+found).
- Checks other file attributes
- E.g. bizarre modification times
3.3. Preventing inconsistencies
- Be careful about order of disk operations
- E.g. in deleting a file:
- Update directory entry first
- If we get a failure, worst that happens is a storage leak
- E.g. expanding a file:
- Update free list
- Then write new blocks
- Then update inode/indirect blocks
- May conflict with disk scheduling optimization
Soft updates
- Allow disk scheduler freedom to order writes for operations on different files, but enforce consistency for individual files
- Not perfect if disk hardware reorders writes
- E.g. in deleting a file:
- Enforce atomicity:
- Journaling
4. More info
There is a detailed on-line book on filesystem implementation at http://www.letterp.com/~dbg/practical-file-system-design.pdf. (Thanks to Philip Taff for pointing this out.)
Don't do this. (1)