Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

The Johnson-Lindenstrauss theorem says that it is possible to project a set of n vectors in a space of arbitrarily high dimension onto an O(log n)-dimensional subspace, such that the distances between the vectors are approximately preserved. There are several versions of the theorem, depending on how the projection is done, but in each case we are looking at showing that for sufficiently large k as a function of n there is some k×d matrix A such that for any two d-dimensional vectors u and v in our set, (1-ε)‖u-v‖² ≤ ‖Au-Av‖² ≤ (1+ε)‖u-v‖². Typical choices for A are:

A projection matrix for a uniform random k-dimensional subspace. (Original Johnson-Lindenstrauss paper, Dasgupta-Gupta proof described below.)
A matrix whose elements are i.i.d. normal variables. (Indyk-Motwani, probably others.)
A matrix whose elements are i.i.d. ±1 variables. (Achlioptas. An appropriate distribution on {-1,0,+1} also works.)

In each case, we multiply the matrix by a fixed scale factor so that E[‖Au‖²] = ‖u‖².

1. Reduction to single-vector case

Most proofs of the theorem reduce to the case of a single vector, by finding a value of k such that Pr[(1-ε)‖u‖² ≤ ‖Au‖² ≤ (1+ε)‖u‖²] ≥ 1-2/n², and then using the union bound to show that this same property holds for all (n choose 2) vectors u-v with nonzero probability. This shows the existence of a good matrix, and we can generate matrices and test them until we find one that actually works.

2. A relatively simple proof of the theorem

This proof is due to Dasgupta and Gupta (see this paper), and is somewhat simpler than many proofs of the theorem. The basic idea is that if A projects onto an uniform random k-dimensional subspace, then its effect on the length of an arbitrary nonzero vector u is the same as its effect on the length of a unit vector u/‖u‖, which is the same as the effect of some specific fixed matrix B applied to a random unit vector Y drawn uniformly from the surface of the d-dimensional sphere S^d. An easy choice for B is just the matrix that extracts the first k coordinates from Y.

So how do we get a random unit vector Y? The usual trick is to use the fact that the multivariate normal distribution is radially symmetric: if we generate d independent normally-distributed random variables X_i, the vector Y = X/‖X‖ is uniformly distributed over S^d. Then Z = BY = (X₁...X_k)/‖X‖, and ‖Z‖² = (X₁² + X₂² + ... + X_k²) / (X₁² + ... X_d²). If each value X_i² were exactly equal to its expectation 1 (the variance of a standard N(0,1) normal random variable), then we would get ‖Z‖² exactly equal to k/d, and we could make ‖Z‖² exactly equal to 1 by multiplying Z by a correction factor of √(d/k). But we can't expect each X_i² to equal 1; instead, we will use a Chernoff bound argument to show that it lies between (1-ε)(k/d) and (1+ε)(k/d) with high probability.

To avoid writing a lot of (1-ε)'s and (1+ε)'s we'll use β instead. For the lower bound, we want to show that it is unlikely that

‖Z‖² ≤ β(k/d)

(X₁² + X₂² + ... + X_k²) / (X₁² + ... X_d²) ≤ β(k/d).

Having a ratio between two sums is a pain, but we can multiply out the denominators to turn it into something we can Chernoff-ify.

$\begin{align*} \Pr[(X_1^2+\dots+X_k^2)/(X_1^2+\dots+X_d^2) \le \beta(k/d)] &= \Pr[d(X_1^2+\dots+X_k^2) \le \beta k (X_1^2+\dots+X_d^2)] \\ &= \Pr[\beta k (X_1^2+\dots+X_d^2) - d(X_1^2+\dots X_k^2) \ge 0] \\ &= \Pr[\exp(t(\beta k (X_1^2+\dots+X_d^2) - d(X_1^2+\dots X_k^2))) \ge 1] \\ &\le \E[\exp(t(\beta k (X_1^2+\dots+X_d^2) - d(X_1^2+\dots X_k^2)))] \\ &= \E[\exp(t \beta k X^2)]^{d-k} \E[\exp(t(\beta k - d)X^2)]^k. \end{align*}$

where t can be any value greater than 0, the shift from Pr to E uses Markov's inequality, and in the last step X is a standard normal random variable (since the X_i are i.i.d.).

Now we just need to be able to compute the moment generating function E[exp(sX²)] for X². The quick way to do this is to notice that X² has a chi-squared distribution with one degree of freedom, and look up the m.g.f. (1-2s)^-1/2 (for s < 1/2). If we are less happy about looking things up, we can use the fact that X²+Y² has an exponential distribution when X and Y are both unit normals to show that E[exp(sX²)]² = 1/(1-2s), and then take the square root of both sides. In either case, we can substitute in the m.g.f. in the formula above to get

$\begin{align*} \Pr[(X_1^2+\dots+X_k^2)/(X_1^2+\dots+X_d^2) \le \beta(k/d)] &\le \left(\frac{1}{1-2\beta k t}\right)^{(d-k)/2} \left(\frac{1}{2t(\beta k - d)}\right)^{k/2}. \end{align*}$

The rest is just the usual trick of finding the minimum value of this expression over all t (that don't produce negative denominators in the m.g.f.!) by differentiating and setting to 0. This is easiest if we maximize 1/g(t)² where g(t) is the above expression; see the Dasgupta-Gupta paper for details. The result is that for β<1:

$\begin{align*} \Pr[\|Z^2\| \le \beta(k/d)] &\le \beta^{k/2}\left(1+\frac{(1-\beta)k}{d-k}\right)^{(d-k)/2} \\ &\le \beta^{k/2} \exp\left(\frac{(1-\beta)k}{d-k}\right)^{(d-k)/2} \\ &= \beta^{k/2} \exp{(1-\beta)(k/2)} \\ &= \exp((k/2)(1-\beta+\ln \beta)). \end{align*}$

The same argument applied to g(-t) gives essentially the same bound for β>1:

$\begin{align*} \Pr[\|Z^2\| \ge \beta(k/d)] &\le \beta^{k/2}\left(1+\frac{(1-\beta)k}{d-k}\right)^{(d-k)/2} \\ &\le \exp((k/2)(1-\beta+\ln \beta)). \end{align*}$

After further manipulations (see paper), this gives the existence of a good map at k ≥ 4(ε²/2 - ε³/3)^-1 ln n, and in general we can get a polynomially-small probability of having distortion >ε for polynomially-many points for some k = O(ε^-2 log n). Note that this depends on n but not d.

Constructing a projection matrix for a random k-dimensional space is mildly painful; the easiest way to do it may be to generate a random k×d matrix of independent N(0,1) variables and then apply Gram-Schmidt orthogonalization, which takes O(k² d) time. A faster approach is to use a matrix of independent ±1 random variables, which doesn't produce an orthogonal projection but can be shown (with much more effort) to still work pretty well: see this paper of Achlioptas.

CategoryRandomizedAlgorithmsNotes