[FrontPage] [TitleIndex] [WordIndex

Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

For more up-to-date notes see http://www.cs.yale.edu/homes/aspnes/classes/465/notes.pdf.

There is currently no generally accepted definition that distinguishes a peer-to-peer system from other distributed systems, but there are some consistent themes that recur in the peer-to-peer literature. The overall idea is that unlike traditional client-server computing where most machines are clients that talk to one or a small number of centralized servers, in a peer-to-peer system most communication is between clients (or peers) and servers—if they are present at all—have only a limited role. However, over time peer-to-peer has come to mean something more specific, involving very large distributed systems for indexing and exchanging information.

The reason for this is that peer-to-peer systems as an active area of research historically arose from the combination of the popular success and technical naïveté of early file-sharing systems like Napster and Gnutella: distributed computing researchers the world over looked at the enormous numbers of copyright-infringing teenagers adopting these systems and imagined that building technically more sophisticated systems that accomplished similar (but legal) tasks would be even more popular. So a first approximation to defining a peer-to-peer system is an overlay network (a network-like structure with a routing mechanism whose edges are paths through some underlying network—the Gnutella influence) that supports some sort of distributed lookup or distributed search operation for resources (valuable published information that is placed in the system with the permission of the copyright owner, if any) that are resident on nodes in the overlay network. The system should also support storing new resources in the network, either by letting each node node simply announce that it has a new resource or (which is more efficient) having a copy of or a pointer to the new resource stored in some special location chosen by the system.

From a traditional theoretical distributed computing perspective, this looks a lot like simulting a shared memory in a message-passing system. What distinguishes peer-to-peer systems currently in use from such traditional systems are the following characteristics:

Symmetry
Ideally, all nodes have a similar role in the system: there are no centralized servers to act as bottlenecks or (in the systems actually deployed) obvious targets for legal attack.
Scale
Peer-to-peer systems are expected to scale to millions or even billions of nodes. This excludes for most applications any protocol that depends on talking to a majority of the nodes. It also places a high value on load balancing.
Openness
Any machine on the network can join the system as a new node. This makes it easy to grow the system, but has horrific security implications.
Churn
New nodes come and go, and the lifespan of typical nodes in the system is short. Think of vast armies of the aforementioned teenagers coming home from high school, booting up their computers, and then shutting them down half an hour later to go do something else. Now imagine this happening across 24 time zones. A good peer-to-peer system should exploit even transient members. However, this means that traditional notions of faulty vs non-faulty processes become a problem: we have to assume that all processes are faulty, and yet somehow the system needs to continue to do whatever it does even if no individual process survives throughout the system's lifetime.
Weak guarantees
Because the system can't control its members, it makes only the feeblest of guarantees about whether it will actually find the resource you are looking for or whether the copy it finds is complete, correct, or even approximately current.

1. Early ad-hoc systems

These were deployed in the late 1990s to allow users to share MP3 files ripped from CDs under an expansive folk interpretation of the US Audio_Home_Recording_Act (AHRA), which legalized in 1992 the then-widespread non-commercial practice of copying music onto cassette tapes and exchanging them with friends.

1.1. Napster

A commercial dot-com venture, Napter featured a centralized server that cataloged music collections stored on subscribers' computers, allowing subscribers' client programs to contact each other directly to exchange files. The peer-to-peer aspect was the direct exchange between clients: this placed the bulk of the communication load on the clients (or peers). Shut down by legal attack on the central server when US courts ruled that the AHRA didn't apply to it; in traditional distributed computing terms, this was a direct demonstration of Napster's vulnerability to crash failures.

1.2. Gnutella

A successor to Napster, still in use in various mutated forms. Distinguished by the absence of a central server. Instead, each node in the system creates links (TCP connections) to a small number of other nodes drawn at random from a large list of known nodes. This creates an overlay graph that is approximately a random graph on nodes, which has various desirable properties including low diameter and high expansion (the ratio between the number of vertices adjacent to any subset of the nodes to the number of nodes in the subset—important for fault tolerance). Updating the lists of known nodes and searching for resources is done by flooding, a horribly inefficient O(N)-message (but O(log N)-time if the graph is really random) operation.

Because Gnutella does not use any central servers, it is much less vulnerable to crash failures: shutting down individual clients has little effect on the rest of the network and merely increases churn.

1.3. FreeNet

An attempt to build a file-sharing system with strong anonymity guarantees and lots of encryption that would in principle be invulnerable to attack by even the most nefarious totalitarian governments, under the assumption that such an entity would never throw users in jail merely because they seemed to be trading around a lot of illicit stuff without being able to actually prove beyond a reasonable doubt precisely which user was guilty of precisely which offense. In its initial version, FreeNet had a complicated self-organizing search protocol that didn't really self-organize and effectively turned into Gnutella-style flooding in practice, and in general one could say that many of the technical assumptions made in its design were thought through about as well as many of the political assumptions. Still under development.

1.4. BitTorrent

Everybody's favorite content distribution protocol; see http://www.bittorrent.com or BitTorrent (protocol) for more details.

BitTorrent is a mechanism for shipping data quickly via a form of structured gossip; the idea is that the original data is divided into fixed-size blocks, and participants can obtain blocks they need from any other participant that already has it. The overlay network itself is an unstructured graph as in Gnutella. But each node engages in a tit-for-tat strategy of preferentially shipping blocks to neighbors who have themselves sent blocks the original node is missing; this tends to concentrate traffic on shorter, faster, or more reliable links when possible.

BitTorrent by itself does not provide indexing services, and each cloud of nodes downloading a particular object is coordinated by a single tracker node in the original design that introduces nodes to each other and provides checksums used to verify blocks and prevent the introduction of bogus data. This can produce the same single-point-of-failure problems as in Napster, so some recent work has been done on constructing distributed trackers, often by integrating some sort of Distributed Hash Table as described below. On the other hand, having a small number of trackers allows more explicit mapping of the cloud structure to the underlying network topology (by having the tracker match up nearby nodes when possible); see http://www.dcia.info/documents/P4P_Overview.pdf for some very active recent work on this.

2. The amazing effectiveness of Napster and Gnutella

The astonishing thing about Napster and Gnutella is that neither system collapsed under its load. Normally, we'd expect a system with a million users pounding on a small number of servers or repeatedly flooding the entire network to fail within seconds. However, the specific application of finding 500-megabyte files indexed by 30-byte CD or movie titles mitigated this problem to a large extent: spamming an network of a million nodes with a 30-byte request for a particular CD or a 6-byte announcement of one's IP address and port number only generates a few dozen megabytes of network traffic (ignoring packet overhead), which is chicken feed compared to the data being swapped directly between peers. But for applications where the search targets are small (which includes most of the hypothetical legal ones), such huge search costs are impractical.

3. Distributed hash tables

The second part of the peer-to-peer story is the collision with academia. From the perspective of traditional distributed computing, the interesting problem in Napster and its successors was indexing: how do we index the contents of a gazillion hard disks without a Google-style central server farm? The resulting systems came to be called distributed hash tables or DHTs, based on the practice in many such systems of hashing data across a large number of servers (which might or might not be recruited from a churning population of clients).

3.1. Chord

Chord (see http://www.pdos.lcs.mit.edu/chord/ or http://www.pdos.lcs.mit.edu/papers/chord:sigcomm01/chord_sigcomm.pdf) is for many researchers the prototypical DHT. Its history predates Napster and goes back to earlier work at MIT on distributing load for web sites using distributed hashing. The basic technical idea is to plant a collection of servers around a ring-shaped virtual address space, e.g. the integers mod 2m for some large m or (theoretically) the reals mod 1. A server at position x is provided with a finger table that points to the first server with position greater than x+1, x+2, x+4, x+8, ... x+2m-1. To search for the highest-address server less than a target address, take the longest hop that doesn't go past the target; it is not hard to show that a search thus takes O(log N) time. To store data, store an item with key k at the first server whose id is less than or equal to h(k), where h is some hash function; the hash function is necessary for load balancing.

Node joins may require updating the finger tables of O(log n) predecessor nodes, which cost O(log n) time each to find (this can be optimized somewhat). Node leaves have similar costs. Insertion and deletion of items requires doing a search (O(log n) time) and then negotiating with the resulting server (O(1) time). To avoid problems with nodes vanishing suddenly without politely informing their predecessors, each node keeps track of all of its near neighbors (say, the next 5 nodes before and after it in the circle) so that it can run a repair mechanism to reconnect if an immediate neighbor fails.

3.2. Tapestry

Tapestry (http://www.cs.ucsb.edu/~ravenben/tapestry/) provides a similar interface to Chord, but works harder at locality (nearby nodes in the overlay network are nearby in the underlying network). The underlying routing scheme looks like a truncated hypercube: each node has a vector identifier, and connects to all other nodes whose vectors differ in exactly one place in the first k locations, where k is adjusted up or down to avoid too many or too few neighbors.

3.3. Kademlia

Kademlia (http://kademlia.scs.cs.nyu.edu/) uses a DHT based on greedy routing on an XOR metric. What this means is that that each node maintains a list of nodes whose IDs are nearby in Hamming distance (number of bits where the IDs differ), and searching for a target involves moving greedily through nodes that are closer in the Hamming-distance metric space. This appears to be the most popular DHT design in current widespread use; see Kademlia for a list of systems that use it.

3.4. Many others

Skip graphs/SkipNets (indexing structures based on skip lists that allow range queries); Koorde (low-degree Chord using de Bruijn graphs); Pastry; Kelips; many other variants.

Skip graphs and SkipNets are interesting because they don't do hashing. This allows searches for e.g. the smallest key in some range or a random key in some range but doesn't provide the built-in load balancing of distributed hash tables. Though such structures have interesting theoretical properties, they have not been used much in practical systems.

4. Issues in peer-to-peer systems

Scalability
The search problem for DHTs has essentially been solved, at least if one ignores locality issues. The variation between the O(log n) vs O(log n/log log n) search costs of different DHTs is interesting theoretically but irrelevant in practice. Remaining scalability issues are load balancing and survival of churn. Load balancing depends on the nature of the underlying DHT
Security

Because of their openness, peer-to-peer systems are vulnerable to takeover by malicious nodes. This is particularly troubling because of the possibility of Sybil attacks Douceur, 2002 where a single machine masquerades as many virtual machines by adopting many IP addresses simultaneously. Such attacks limit the ability of the system to rely on a most of the real nodes being non-malicious. Defeating Sybil attacks is an active area of research: typical methods involve exploiting the network structure to exclude suspiciously large groups of processes that are reached through the same path (as in Haifeng Yu's SybilGuard) or that appear to be at the same physical location (as in Bazzi and Konjevod's geometric certificates).

Fault-tolerance
Here the goal is survival against random failures and churn. Systems like Kademlia seem to be good enough for general use. Some systems have stronger guarantees, e.g. provably contain expanders.
Rapid deployment and/or repair

Here the goal is to build an overlay network quickly from scratch (specifically, from a weakly-connected pointer graph). See e.g. Angluin et al for an O(log² N)-time protocol for constructing a skip graph or Chord ring from scratch, or Montresor et al.'s Chord on demand paper for what appears to be an O(log N)-time protocol (as verified by simulation) based on gossip. The tricky part here seems to be reducing the diameter of the overlay network quickly. One plausible approach is to apply local random transformations to try to get a random graph as suggested by Mahlmann and Schindelhauer. The open question here is how to prove that this approach actually converges in the O(log N) time it appears to converge in as opposed to the best currently known bound of O(nbig) http://www.cc.gatech.edu/~mihail/switch.pdf. Another alternative is to place restrictions on the initial graph, e.g. by assuming that each node starts with out-degree one (http://www.cs.yale.edu/homes/aspnes/out-degree-1-abstract.html). Many more papers on this subject can be found here.

Efficient unstructured systems

Is it possible to perform search efficiently without actually building a distributed data structure? The goal here would be to have some sort of distributed protocol that would tend to push data toward the same places it was needed. What this actually means depends on defining unstructured—does Kademlia count?

Topology awareness

First-generation peer-to-peer systems effectively assumed that all networks links are equally fast, which is not realistic. Quite a bit of recent work has looked at how to structure a peer-to-peer system to respect the underlying limitations of the network, ranging from simple hacks like the split between ordinary peers and high-bandwidth superpeers in some recent systems based on Gnutella, to sophisticated network management mechanisms like the P4P extensions to BitTorrent. This continues to be a very active area of research.

5. Current work

Much of the work in peer-to-peer systems appears in IPTPS; see http://www.iptps.org.


CategoryDistributedComputingNotes


2014-06-17 11:58