Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

For more up-to-date notes see http://www.cs.yale.edu/homes/aspnes/classes/465/notes.pdf.

1. The Sybil attack

Idea is that one bad machine can masquerade as many different machines using routing tricks. This defeats any distributed algorithm based on assuming a fixed fraction of the processes are bad.

Problem for any system with "open admissions," where any machine on the the network can join it. Examples of such systems are most PeerToPeer networks, the SMTP-based mail system, and the World-Wide Web.

Popularized by a paper by John Douceur from IPTPS 2001 (sybil-attack.pdf), which analyzes several methods for attempting to defeat Sybil attacks. The term Sybil attack itself is credited to Brian Zill in Douceur's paper, and is based on the book (and later movie) Sybil about a psychiatric patient with multiple-personality disorder.

Note that Sybil attacks do not include attacks using botnets, where the problem is that we really do have 10,000,000 bad nodes overwhelming our system; instead, we are worrying about the case where a small number of bad routers claim to have 10,000,000 bad nodes behind them but these nodes are only simulated by a small number of machines.

Sybil attack has a surprisingly good list of references (the article itself is a little weak as of 2010-02-26).

2. Resource-based defenses

Douceur considers several resource-based defenses that had been proposed for solving similar problems at the time of writing of his paper. A typical approach is to ask each participant in the system to carry out some computationally-intensive task, the idea being that Sybil nodes will be overwhelmed by carrying out this task for their 10,000,000 virtual clones. This method goes back to a paper by Dwork and Naor from CRYPTO 1992 (see http://www.wisdom.weizmann.ac.il/~naor/PAPERS/spam.html for this and some related work) and is frequently reinvented. The problem is that processing is cheap: if you have to waste 30 seconds to verify your identity, then I can generate new identities at the cost of 30 seconds of computation, or about $0.0002 per id at current http://aws.amazon.com/ec2/ Amazon EC2 spot prices.¹ The price to the bad guys can be increased by raising the computational bar, but only at the cost of inconveniencing legitimate users. The problem is particularly acute if the processing can be done ahead of time, allowing the bad guys to generate an unbounded number of identities in advance.

CAPTCHAs are a version of this method that is currently popular, using tasks that are hard to automate. This raises the cost of a fake identity to $0.002 (see for example http://decaptcher.com/client/), 10 times our hypothetical pricing-via-processing approach but still pretty cheap.

The other category of defenses considered by Douceur are reputation systems based on accepting only users who are vouched for by sufficiently many known-good users. Here the problem is that as soon as Sybil gets a quorum of bad identities accepted, she can send in as many additional identities as she likes. So a naive reputation-based approach won't work either.

3. Location-based approach

The first really credible approach to preventing Sybil attacks may have been Bazzi and Konjevod's location-based approach Distributed Computing 19(4):267–287. Here a process that wants to authenticate itself submits a geometric certificate consisting of verified ping times to a collection of standardized beacon nodes. Multiple virtual machines located at the same physical location will end up with essentially the same certificate, and can be treated as one (possibly corrupted) node. So will multiple users at large institutions with a single outgoing pipe to the rest of the network.

4. Social network approach

Pioneered by SybilGuard sybilguard.pdf and subsequently SybilLimit sybillimit.pdf, with further improvements in the Lesniewski-Laas paper below. We will concentrate on the basic model and the solution given by SybilGuard.

Model is that we organize our nodes into a social network graph where each vertex is a participant and an edge means two participants know and trust each other (in class we considered that maybe if I am your follower on Twitter we don't put an edge in, because I might be a spammer, but if we follow each other, maybe that counts as an edge). The observation is if the Sybil nodes are being simulated by a very small number of real nodes, then the graph decomposes into two large clusters (real nodes and Sybil nodes) with a small number of attack edges between them, corresponding to the small number of real nodes that the bad nodes have convinced to friend them. We assume that the good cluster is rapid mixing; the bad cluster (since it is an illusian created by a handful of bad nodes) can have any structure the bad nodes choose. But the important part is that the bottleneck between the is small.

Unfortunately finding that bottleneck is computationally expensive even if we can collect the entire graph centrally. So instead we use various heuristics to exclude most Sybil nodes while keeping most good nodes.

In SybilGuard, assume each node has degree exactly d and create a random routing table at each node that routes each incoming edge to a distinct outgoing edge (essentially this is just a permutation of the edges). Now we have each node generate d outgoing paths of length Θ(√n log n) by starting with their outgoing edges and then just following the routing tables. I consider you to be good if one of my paths intersects one of yours.

First we argue that this works for recognizing good processes. Any two good processes whose paths stay in the good region are likely to cross paths because of the Birthday paradox—with rapid mixing each path reaches Θ(√n) almost-uniform random nodes, and with Θ(√n)⋅Θ(√n) = Θ(n) pairs that might collide with probability 1/n the expected number that do collide is at least a constant. But the number of good processes whose paths don't stay in the good region is small, since there are few attack edges and most paths won't hit them.

Next we have to argue that not too many Sybil nodes can pass themselves off as real. Because routing is deterministic (once we fix the routing tables), and because routing is reversible (since we can run the routing tables backwards), for each attack edge leaving the bad component there is only one path that follows it. So for each node on the path, there are at most O(√n log n) nodes (virtual or otherwise) that could have reached that node legitimately, and we can exclude any extra nodes that claim to get there by some other path that happens to have the same tail. For each of the g attack edges, we have O(log n) or thereabouts expected points of intersection with each verifier's path, and each point of intersection only adds Õ(√n) accepted nodes, so we get Õ(√n) bad nodes accepted per attack edge.

SybilGuard improves this to O(log n) by replacing d paths of length Θ(√n log n) with Θ(√m) paths (m = # of edges) of length Θ(log n) each; this requires having Θ(√m) independent random routing tables at each node and a much messier rule for deciding who is good or not.

Even better is this result in the same model from the speaker from the day before class: Chris Lesniewski-Laas, A Sybil-proof one-hop DHT, Proceedings of the 1st Workshop on Social Network Systems, 2008. We didn't talk about this at all.

Another paper mentioned in class while discussing why the social network approach still might not help: Bilge et al., All your contacts are belong to us: automated identity theft attacks on social networks, International World Wide Web Conference, 2009.

CategoryDistributedComputingNotes

If I am particularly amoral, rumor has it that one can rent 100,000-node botnets from Russian mafia guys for about $400/hour, reducing the price further to about $0.00003 per id. (1)