A network file system, distribute file system, or remote file system allows files stored on a remote server machine to be accessed as part of the filesystem of a local client machine. In order to do this, the local filesystem must translate what would otherwise appear to be local file operations into remote procedure calls working on the remote filesystem. There are two network file systems in widespread use today: NFS, originally developed by Sun, which is the dominant system in Unix and Linux systems, and CIFS or SMB for Windows systems. We'll mostly talk about the development and features of NFS, and then talk about some less dominant systems that provide other interesting features like detached operation.
For more information on distributed file systems in general see SGG Chapter 17; this also includes a very nice discussion of AFS, a particularly sophisticated distributed file system that never really caught on because it wasn't free. The original design considerations for NFS in particular are documented in nsf-design.pdf.
1. Naming issues
Given an existing hierarchical file system, the simplest approach to naming remote files is to mount the remote filesystem as a subdirectory of the local filesystem. So, for example, in the Zoo machines /home is a mount point for artemis.cs.yale.edu:/home, an NFS file system exported by artemis.cs.yale.edu, the Zoo fileserver. A goal in adopting this approach (as discussed, for example, in the NFS design paper) is to obtain location transparency: a local process shouldn't have to know what machine actually stores the files it is using, so it can freely refer to /home/some-user/.bash_profile or some such without making any adjustments for remote access.
The price for this location transparency is that remote file systems must be explicitly mounted (e.g. at boot time based on being listed in /etc/fstab), so that the local machine can translate addresses as needed. This limits use to servers that are known in advance, and is very different from the distributed-file-system-like functionality of HTTP, where a user can talk to any server that they like. An alternative approach is to encode the remote server address directly in the pathname, e.g. /@remote.server.com/usr/local/bin/hello-world or /http/pine.cs.yale.edu/pinewiki/422/Schedule, and rely on symbolic links or similar mechanisms to hide the location dependence. It is not clear why the NFS approach of requiring explicit mounting came to dominate in the Unix world; perhaps it was because of a combination of the security dangers of eroding the separation between local resources (or at least resources on a server under the control of the local system administrators) and remote resources, and the problems that arise when programs that expect to be able to read files without errors encounter the misbehavior of resources accessed across a WAN.
2. Caching and consistency
Caching is less of an issue for distributed file systems running across a LAN than one might think: the bottleneck in the filesystem is likely to be the disk rather than the intervening network, so assuming the network stays up there is not much of an incentive to cache files on a local disk to improve performance. However, consistency of in-memory caches may be an issue: since the client machine can't necessarily see what changes are being made to a remote file, it can't easily guarantee that any data it has cached in memory will be up to date.
As with other problems in operating systems, there are a range of solutions to this problem. The simplest approach is to either avoid caching data locally (beyond minimal buffering inside stdio or similar libraries) or accept that local data may be out of date. A more sophisticated approach requires running some sort of cache-consistency protocol, either by having the client query the server for updates on each access or by having the server send callbacks to clients when data changes. The client-initiated model puts a lot of load on the server if reads are common but writes are few, and it's not clear that it is much faster than simply passing read operations through to the server unless network bandwidth is severely limited. The server-initiated approach will be less expensive when writes are rare, but requires that the server maintain state about what files each client has cached.
Changes in the semantics of the file system can affect cache costs. For example, the write-on-close policy of the Andrew File System (AFS) ensures that all updates to a file are consolidated into a single giant write operation—this means that the server only needs to notify interested clients once when the modified file is closed instead of after every write to a part of the file. The cost here is that it is no longer possible to have interleaved write operations on the same file by different clients.
The ultimate solution to consistency problems is typically locking. The history of NFS is illustrative here. Early versions of NFS did not provide any locking at all, punting the issue to separate lock servers (that were ultimately not widely deployed). Practical solutions involved the use of lockfiles based on the fact that creating a file under NFS always involved talking to the server and that Unix file semantics provided an O_EXCL operation to the creat call that would refuse to create a file that already existed. The problems with such ad-hoc solutions (mostly the need to rewrite any program that used locks to use lockfiles instead) eventually forced NFS to incorporate explicit support for Posix-style advisory fcntl locks.
3. Stateful vs stateless servers
The original distributed version of NFS (NFS version 2) used a stateless protocol in which the server didn't keep track of any information about clients or what files they were working on. This has a number of advantages:
- Because the server knows nothing about clients, adding more clients consumes no resources on the server (although satisfying their increased requests may).
There is no possibility of inconsistency between client and server state, because there is no server state. This means that problems like TwoGenerals don't come up with a stateless server, and there is no need for a special recovery mechanism after a client or server crashes.
The problem with a stateless server is that it requires careful design of the protocol so that the clients can send all necessary information along with a request. So for example a Unix-style file descriptor that tracks a position in the file must be implemented at the client (since the server has no state with which to track this position), and a client write(file_descriptor, data, count) operation on the client translates to a write(file_handle, offset, count, data) on the wire to the server. The inclusion of an explicit offset, and the translation of the local file descriptor to a file handle that contains enough information to uniquely identify the target file without a server-side mapping table, means that the server can satisfy this request without remembering it.
A second feature we want with a stateless server is idempotence: performing the same operation twice should have the same effect as performing it once. This allows a client to deal with lost messages, lost acknowledgments, or a crashed server in the same way: retransmit the original request and hope that it works this time. It is not hard to see that including offsets and explicit file handles gives us this property.
4. Data representation
An issue that arises for any network service but that is particularly tricky for filesystems is the issue of machine-independent data representation. Many of the values that will be sent across the network (e.g. file offsets or buffer size counts) are binary values that might (a) be stored in different widths by default on different machines, and (b) might be stored with different byte order or Endianness. So an x86-based client talking to a PowerPC-based server will need to agree on the number of bytes (called octets in IETF RFC documents, to emphasize 8-bit bytes as opposed to the now-bizarre-seeming non-8-bit bytes that haunted the early history of computing) in each field of a data structure as well as the order in which they arrive.
The convention used in most older network services is to use a standard network byte order which is defined to be big-endian or most significant byte first. This means that are hypothetical x86 client will need to byte swap all of its integer values before sending them out. Such byte-swapping is usually done in a library routine, so that client or server writers don't need to know about the endianness of the target machine when writing code. If the target machine is already big-endian, the byte-swapping routine can be a no-op. (See for example man ntohs on any Unix or Linux machine.)
More recent services have tended to move to text-based representations. Byte order doesn't come up in HTTP because a content-length field is the ASCII string Content-length: 23770 instead of a 4-byte integer value packed at position 17 in the some hypothetical binary-format HTTP response header. This approach has the advantage of making programmer error harder, at the cost of consuming more network bandwidth, since a decimal digit packs only about 3⅓ bits of information in an 8-bit byte. For things like HTTP headers that are attached to large documents, the additional cost in bandwidth is trivial. An extreme example of this is self-documenting XML-based encodings like XML-RPC and its successor SOAP.
5. Case study: NFS version 2