Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

Contents

Random variables
The distribution of a random variable
The expectation of a random variable
The variance of a random variable
Probability generating functions
1. Sums
2. Expectation and variance
Summary: effects of operations on expectation and variance of random variables
The general case

1. Random variables

A random variable X is a variable that takes on particular values randomly. This means that for each possible value x, there is an event [X=x] with some probability of occurring that corresponds to X (the random variable, usually written as an upper-case letter) taking on the value x (some fixed value).

Examples of random variables:

Indicator variables: The indicator variable for an event A is a variable X that is 1 if A occurs and 0 if it doesn't (i.e., X(ω) = 1 if ω∈A and 0 otherwise). Indicator variables are often written using the Greek letter chi (e.g. χ_A) or using bracket notation (e.g., [A], [Y² > 3], [all six coins come up heads]).
Functions of random variables: Any function you are likely to run across of a random variable or random variables is a random variable, e.g. X+Y, XY, log X, etc.
Counts: Flip a fair coin n times and let X be the number of times it comes up heads. Then X is an integer-valued random variable.
Random sets and structures: Suppose that we have a set T of n elements, and we pick out a subset U by flipping an independent fair coin for each element to decide whether to include it. Then U is a set-valued random variable. Or we could consider the infinite sequence X₀, X₁, X₂, ..., where X₀ = 0 and X_n+1 is either X_n+1 or X_n-1, which being chosen by an independent fair coin flip. Then we can think of the entire sequence X as a sequence-valued random variable.

2. The distribution of a random variable

The distribution of a random variable describes the probability that it takes on various values. For discrete random variables (variables that take on only countably many values, each with nonzero probability), the distribution is most easily described by just giving the probability Pr[X=x] for each value x.¹

Often if we know the distribution of a random variable, we don't bother worrying about what the underlying probability space is. The reason for this is we can just take Ω to be the range of the random variable, and define Pr[ω] for each ω∈Ω to be Pr[X=ω]. For example, a six-sided die corresponds to taking Ω = {1,2,3,4,5,6}, assigning Pr[ω] = 1/6 for all ω∈Ω, and letting X(ω) = ω.

The same thing works if we have multiple random variables, but now we let each point in the space be a tuple that gives the values of all of the variables. Specifying the probability in this case is done using a joint distribution (see below).

2.1. Some standard distributions

Here are some common distributions for a random variable X:

Bernoulli distribution: Pr[X = 1] = p, Pr[X = 0] = q, where p is a parameter of the distribution and q = 1-p. This corresponds to a single biased coin-flip.
Binomial distribution: Pr[X = k] = (n choose k) p^k q^(n-k), where n and p are parameters of the distribution and q = 1-p. This corresponds to the sum of n biased coin-flips.
Geometric distribution: Pr[X = k] = q^kp, where p is a parameter of the distribution and q is again equal to 1-p. This corresponds to number of tails we flip before we get the first head in a sequence of biased coin-flips.
Poisson distribution: Pr[X = k] = e^-λλ^k/k!. This is what happens to a binomial distribution when we fix p = λ/n and then take the limit as n goes to infinity; we can think of it as counting the number of events that occur in one time unit if the events occur at a constant continuous rate that averages λ events per time unit. The canonical example is radioactive decay.
Uniform distribution: The distribution function F of X is given by F(x) = 0 when x ≤ a, (x-a)/(b-a) when a ≤ x ≤ b, and 1 when b ≤ x, where a and b are parameters of the distribution. This is a continuous random variable that has equal probability of landing anywhere in the [a,b] interval.

Another very important distribution is the normal distribution, but it's not discrete so we would need more machinery to talk about it.

2.2. Joint distributions

Two or more random variables can be described using a joint distribution. For discrete random variables, this just gives Pr[X=x ∧ Y=y] for all fixed values x and y, or in general gives Pr[∀i X_i=x_i].

Given a joint distribution on X and Y, we can recover the distribution on X or Y individually by summing up cases: Pr[X=x] = ∑_y Pr[X=x ∧ Y=y]. The distribution of X obtained in this way is called a marginal distribution of the original joint distribution. In general we can't go in the other direction, because just knowing the marginal distributions doesn't tell us how the random variables might be related to each other.

Examples:

Let X and Y be six-sided dice. Then Pr[X=x ∧ Y=y] = 1/36 for all values of x and y in the right range. The underlying probability space consists of all pairs (x,y) in {1,2,3,4,6}×{1,2,3,4,5,6}.
Let X be a six-sided die and let Y = 7 - X. Then Pr[X=x ∧ Y=y] = 1/6 if 1 ≤ x ≤ 6 and y = 7-x, and 0 otherwise, with the underlying probability space most easily described by including just six points for the X values (although we could also do {1,2,3,4,5,6}×{1,2,3,4,5,6} as in the previous case, just assigning probability 0 to most of the points). However, even though the joint distribution is very different from the previous case, the marginal distributions of X and Y are exactly the same as before: each of X and Y takes on all values in {1,2,3,4,5,6} with equal probability.

2.3. Independence of random variables

The difference between the two preceding examples is that in the first case, X and Y are independent, and in the second case, they aren't.

Two random variables X and Y are independent if any pair of events of the form X∈A, Y∈B are independent. For discrete random variables, it is enough to show that Pr[X=x ∧ Y=y] = Pr[X=x]⋅Pr[Y=y], or in other words that the events [X=x] and [Y=y] are independent for all values x and y. In practice, we will typically either be told that two random variables are independent or deduce it from how they are generated.

Examples:

Roll two six-sided dice, and let X and Y be the values of the dice. By convention we assume that these values are independent. This means for example that Pr[X∈{1,2,3} ∧ Y∈{1,2,3}] = [Pr[X∈{1,2,3}]⋅Pr[Y∈{1,2,3}] = (1/2)(1/2) = 1/4, which is a slightly easier computation than counting up the 9 cases (and then argue that each occurs with probability (1/6)², which requires knowing that X and Y are independent).
Take the same X and Y, and let Z = X+Y. Now Z and X are not independent, because Pr[X=1 ∧ Z=12] = 0, which is not equal to Pr[X=1]⋅Pr[Z=12] = (1/6)(1/36) = 1/216.
Place two radioactive sources on opposite sides of the Earth, and let X and Y be the number of radioactive decay events in each source during some 10ms interval. Since the sources are 42ms away from each other at the speed of light, we can assert that either X and Y are independent, or the world doesn't behave the way the physicists think it does. This is an example of variables being independent because they are physically independent.
- Roll one six-sided die X, and let Y = ⌈X/2⌉ and Z = X mod 2. Then Y and Z are independent, even though they are generated using the same physical process.

In general, if we have a collection of random variables X_i, we say that they are all independent if the joint distribution is the product of the marginal distributions, i.e. if Pr[∀i X_i=x_i] = ∏_i Pr[X_i=x_i]. It may be that a collection of random variables is not independent even though all subcollections are.

Let X and Y be fair coin-flips, and let Z = X⊕Y. Then any two of X, Y, and Z are independent, but the three variables X, Y, and Z are not independent, because (for example) Pr[X = 0 ∧ Y = 0 ∧ Z = 0] = 1/4 instead of 1/8 as one would get by taking the product of the marginal distributions.

Since we can compute the joint distribution from the marginal distributions for independent variables, we will often just specify the marginal distributions and declare that a collection of random variables are independent. This implicitly gives us an underlying probability space consisting of all sequences of values for the variables.

3. The expectation of a random variable

For a real-valued random variable X, its expectation E[X] (sometimes just EX) is its average value, weighted by probability.²> For discrete random variables, the expectation is defined by

$\E[X] = \sum_{x} x \Pr[X=x].$

Example (discrete variable): Let X be the number rolled with a fair six-sided die. Then E[X] = (1/6) (1+2+3+4+5+6) = 3½.
Example (unbounded discrete variable): Let X be a geometric random variable with parameter p, i.e., let Pr[X = k] = q^kp, where q = 1-p. Then E[X] = $\sum_{k=0}^{\infty} k q^k p = p \sum_{k=0}^{\infty} k q^k = p \cdot \frac{q}{(1-q)^2} = \frac{pq}{p^2} = \frac{q}{p} = \frac{1-p}{p} = \frac{1}{p}-1$ .

Expectation is a way to summarize the distribution of a random variable without giving all the details. If you take the average of many independent copies of a random variable, you will be likely to get a value close to the expectation. Expectations are also used in decision theory to compare different choices. For example, given a choice between a 50% chance of winning $100 (expected value: $50) and a 20% chance of winning $1000 (expected value: $200), a rational decision maker would take the second option. Whether ordinary human beings correspond to an economist's notion of a rational decision maker often depends on other details of the situation.

Terminology note: If you hear somebody say that some random variable X takes on the value z on average, this usually means that E[X] = z.

3.1. Variables without expectations

If a random variable has a particularly annoying distribution, it may not have a finite expectation, even thought the variable itself takes on only finite values. This happens if the sum for the expectation diverges.

For example, suppose I start with a dollar, and double my money every time a fair coin-flip comes up heads. If the coin comes up tails, I keep whatever I have at that point. What is my expected wealth at the end of this process?

Let X be the number of times I get heads. Then X is just a geometric random variable with p = 1/2, so Pr[X=k] = (1-(1/2))^k(1/2)^k = 2^-k-1. My wealth is also a random variable: 2^X. If we try to compute E[2^X], we get

$\begin{align*} \E[2^X] &= \sum_{k=0}^{\infty} 2^k \Pr[X = k] \\ &= \sum_{k=0}^{\infty} 2^k \cdot 2^{-k-1} \\ &= \sum_{k=0}^{\infty} \frac{1}{2}, \end{align*}$

which diverges. Typically we say that a random variable like this has no expected value, although sometimes you will see people writing E[2^X] = ∞.

(For an even nastier case, consider what happens with E[(-2)^X].)

3.2. Expectation of a sum

The expectation operator is linear: this means that E(X+Y) = E(X)+E(Y) and E(aX) = aE(X) when a is a constant. This fact holds for all random variables X and Y, whether they are independent or not, and is not hard to prove for discrete probability spaces:

$\begin{align*} \E[aX+Y] &= \sum_{x,y} (ax+y) \Pr[X = x \wedge Y = x] \\ &= a \sum_{x,y} x \Pr[X = x \wedge Y = x] + \sum_{x,y} y \Pr[X = x \wedge Y = x] \\ &= a \sum_x x \sum_y \Pr[X = x \wedge Y = x] + \sum_y y \sum_x \Pr[X = x \wedge Y = x] \\ &= a \sum_x x \Pr[X=x] + \sum_y y \Pr[Y=y] \\ &= a\E[X]+\E[Y]. \end{align*}$

Linearity of expectation makes computing many expectations easy. Example: Flip a fair coin n times, and let X be the number of heads. What is E[X]? We can solve this problem by letting X_i be the indicator variable for the event "coin i came up heads". Then X = ∑_i X_i and EX = E[∑_i X_i] = ∑_i EX_i = n(½) = n/2. In principle it is possible to calculate the same value from the distribution of X (which involves a lot of binomial coefficients), but linearity of expectation is much easier.

Example: Choose a random permutation f, i.e. a random bijection from 1..n → 1..n. What is the expected number of values i for which f(i) = i?
Solution: Let X_i be the indicator variable for the event that f(i) = i. Then we are looking for E[X₁ + X₂ + ... X_n] = E[X₁] + E[X₂] + ... E[X_n]. But E[X_i] is just 1/n for each i, so the sum is n(1/n) = 1. (Calculating this by computing Pr[∑X_i] = x first would be very painful.)

3.3. Expectation of a product

For products of random variables, the situation is more complicated. Here the rule is that E[XY] = E[X]⋅E[Y] if X and Y are independent. But if X and Y are not independent, their expectation can be arbitrary.

Example: Roll 2 dice and take their product. What value do we get on average? The product formula gives E[XY] = E[X]E[Y] = (7/2)² = (49/4) = 12¼. We could also calculate this directly by summing over all 36 cases, but it would take a while.
Example: Roll 1 die and multiply it by itself. What value do we get on average? Here we are no longer dealing with independent random variables, so we have to do it the hard way: E[X²] = (1²+2²+3²+4²+5²+6²)/6 = 91/6 = 15⅙. This is substantially higher than when the dice are uncorrelated. (Exercise: How can you rig the second die so it still comes up with each value ⅙ of the time but minimizes E[XY]?)

We can prove the product rule without too much trouble for discrete random variables. The easiest way is to start from the right-hand side.

$\begin{eqnarray*} \E[X] \cdot \E[Y] &=& \left(\sum_{x} x \Pr[X=x]\right)\left(\sum_{y} y \Pr[Y=y]\right) \\ &=& \sum_{x,y} xy \Pr[X=x] \Pr[Y=y] \\ &=& \sum_{z} z \left(\sum_{x,y, xy=z} \Pr[X=x] \Pr[Y=y]\right) \\ &=& \sum_{z} z \left(\sum_{x,y, xy=z} \Pr[X=x \wedge Y=y]\right) \\ &=& \sum_{z} z \Pr[XY=z] \\ &=& \E[XY]. \end{eqnarray*}$

Here we use independence in going from Pr[X=x] Pr[Y=y] to Pr[X=x ∧ Y=y] and use the union rule to convert the x,y sum into Pr[XY=z].

3.4. Conditional expectation

Like conditional probability, there is also a notion of conditional expectation. The simplest version of conditional expectation conditions on a single event A, is written E[X|A], and is defined for discrete random variables by

$E[X|A] = \sum_{x} x \Pr[X=x|A].$

This is exactly the same as unconditioned expectation except that the probabilities are now conditioned on A.

Example: What is the expected value of a six-sided die conditioned on not rolling a 1? The conditional probability of getting 1 is now 0, and the conditional probability of each of the remaining 5 values is 1/5, so we get (1/5)(2+3+4+5+6) = 4.

Conditional expectation acts very much like regular expectation, so for example we have E[aX+bY|A] = aE[X|A] + bE[Y|A].

One of the most useful applications of conditional expectation is that it allows computing (unconditional) expectations by case analysis, using the fact that

E[X] = E[X|A]Pr[A] + E[X|¬A]Pr[¬A].

3.4.1. Examples

3.4.1.1. The Treasure on Mt Everest

I have a 50% chance of reaching the top of Mt Everest, where Sir Edmund Hilary and Tenzing Norgay hid somewhere between 0 and 10 kilograms of gold (a random variable with uniform distribution). How much gold do I expect to bring home? Compute E[X] = E[X|reached the top]Pr[reached the top] + E[X|didn't make it]Pr[didn't make it] = 5⋅0.5 + 0⋅0.5 = 2.5.

3.4.1.2. Waiting times

Suppose I flip a coin that comes up heads with probability p until I get heads. How many times on average do I flip the coin?

We'll let X be the number of coin flips. Conditioning on whether the coin comes up heads on the first flip gives E[X] = 1⋅p + (1+E[X'])⋅(1-p), where X' is random variable counting the number of coin-flips needed to get heads ignoring the first coin-flip. But since X' has the same distribution as X, we get EX = p + (1-p) (1+EX) or EX = (p+(1-p))/p = 1/p. So a fair coin must be flipped twice on average to get a head, which is about what we'd expect if we hadn't thought about it much.

3.4.1.3. Rewarding success and punishing failure

Suppose I have my experimental test subjects complete a task that gets scored on a scale of 0 to 100. I decide to test whether rewarding success is a better strategy for improving outcomes than punishing failure. So for any subject that scores high than 50, I give them a chocolate bar. For any subject that scores lower than 50, I give them an electric shock. (Students who score exactly 50 get nothing.) I then have them each perform the task a second time and measure the average change in their scores. What happens?

Let's suppose that there is no effect whatsoever of my rewards and punishments, and that each test subject obtains each possible score with equal probability 1/101. Now let's calculate the average improvement for test subjects who initially score less than 50 or greater than 50. Call the outcome on the first test X and the outcome on the second test Y.

In the first case, we are computing E[Y-X | X < 50]. This is the same as E[Y | X < 50] - E[X | X<50] = E[Y] - E[X|X<50] = 50 - 24.5 = +25.5. So punishing failure produces a 25.5 point improvement on average.

In the second case, we are computing E[Y-X | X > 50]. This is the same as E[Y | X > 50] - E[X | X>50] = E[Y] - E[X|X>50] = 50 - 75.5 = -25.5. So rewarding success produces a 25.5 point decline on average.

Clearly this suggests that we punish failure if we want improvements and reward success if we want backsliding. This is intuitively correct: punishing failure encourages our slacker test subjects to do better next time, while rewarding success just makes them lazy and complacent. But since the test outcomes don't depend on anything we are doing, we get exactly the same answer if we reward failure and punish success: in the former case, a +25.5 point average change, in the later a -25.5 point average change. This is also intuitively correct: rewarding failure makes our subjects like the test so that they will try to do better next time, while punishing success makes them feel that it isn't worth it. From this we learn that our intuitions³ provide powerful tools for rationalizing almost any outcome in terms of the good or bad behavior of our test subjects. A more careful analysis shows that we performed the wrong comparison, and we are the victim of regression to the mean.

For a real-world example of how similar problems can arise in processing data, the United States Bureau of Labor Statistics defines a small business as any company with 500 or fewer employees. So if a company has 400 employees in 2007, 600 in 2008, and 400 in 2009, then we just saw a net creation of 200 new jobs by a small business in 2007, followed by the destruction of 200 jobs by a large business in 2008. This effect alone accounts for most of the observation that small businesses generate more new jobs than large ones.

3.4.2. Conditioning on a random variable

There is a more general notion of conditional expectation for random variables, where the conditioning is done on some other random variable Y. Unlike E[X|A], which is a constant, the expected value of X conditioned on Y, written E[X|Y], is itself a random variable: when Y = y, it takes on the value E[X|Y=y].

Example: Let us compute E[X+Y|X] where X and Y are the values of independent six-sided dice. When X = x, E[X+Y|X] = E[X+Y|X=x] = x+E[Y] = x + 7/2. For the full random variable we can write E[X+Y|X] = X + 7/2.

Another way to get the result in the preceding example is to use some general facts about conditional expectation:

E[aX+bY|Z] = aE[X|Z] + bE[Y|Z]. This is the conditional-expectation version of linearity of expectation.
E[X|X] = X. This is immediate from the definition, since E[X|X=x] = x.
If X and Y are independent, then E[Y|X] = E[Y]. The intuition is that knowing the value of X gives no information about Y, so E[Y|X=x] = E[Y] for any x in the range of X. (To do this formally requires using the fact that Pr[Y=y|X=x] = Pr[Y=y /\ X=x] / Pr[X=x] = Pr[Y=y]Pr[X=x]/Pr[X=x] = Pr[Y=y], provided X and Y are independent.)
Also useful: E[E[X|Y]] = E[X]. Averaging a second time removes all dependence on Y.

These in principle allow us to do very complicated calculations involving conditional expectation. For example:

Example: Let X and Y be the values of independent six-sided dice. What is E[X | X+Y]? Here we observe that X+Y = E[X+Y | X+Y] = E[X | X+Y] + E[Y | X+Y] = 2E[X | X+Y] by symmetry. So E[X | X+Y] = (X+Y)/2. This is pretty much what we'd expect: on average, half the total value is supplied by one of the dice. (It also works well for extreme cases like X+Y = 12 or X+Y = 2, giving a quick check on the formula.)
Example: What is E[(X+Y)² | X] when X and Y are independent? Compute E[(X+Y)² | X] = E[X² | X] + 2E[XY | X] + E[Y² | X] = X² + 2X E[Y] + E[Y²]. For example, if X and Y are independent six-sided dice we have E[(X+Y)² | X] = X² + 7X + 91/6, so if you are rolling the dice one at a time and the first one comes up 5, you can expect on average to get a squared total of 25 + 35 + 91/6 = 75 1/6. But if the first one comes up 1, you only get 1 + 7 + 91/6 = 23 1/6 on average.

3.5. Markov's inequality

Knowing the expectation of a random variable gives you some information about it, but different random variables may have the same expectation but very different behavior: consider, for example, the random variable X that is 0 with probability 1/2 and 1 with probability 1/2 and the random variable Y that is 1/2 with probability 1. In some cases we don't care about the average value of a variable so much as its likelihood of reaching some extreme value: for example, if my feet are encased in cement blocks at the beach, knowing that the average high tide is only 1 meter is not as important as knowing whether it ever gets above 2 meters. Markov's inequality lets us bound the probability of unusually high values of non-negative random variables as a function of their expectation. It says that, for any a > 0,

Pr[X > aEX] < 1/a.

This can be proved easily using conditional expectations. We have:

EX = E[X|X > aEX] Pr[X > aEX] + E[X|X ≤ aEX] Pr[X > aEX].

Since X is non-negative, E[X|X ≤ aEX] ≥ 0, so subtracting out the last term on the right-hand side can only make it smaller. This gives:

EX ≥ E[X|X > aEX] Pr[X > aEX] > aEX Pr[X > aEX]

and dividing both side by aEX gives the desired result.

Another version of Markov's inequality replaces > with ≥:

E[X ≥ aEX] ≤ 1/a.

The proof is essentially the same.

Example: Suppose that that all you know about high tide is that EX = 1 meter and X ≥ 0. What can we say about the probability that X > 2 meters? Using Markov's inequality, we get Pr[X > 2 meters] = Pr[X > 2EX] < 1/2.

There is, of course, a conditional version of Markov's inequality:

Pr[X > aE[X|A] | A] < 1/a.

This version doesn't get anywhere near as much use as the unconditioned version, but it may be worth remembering that it exists.

4. The variance of a random variable

Expectation tells you the average value of a random variable but it doesn't tell you how far from the average the random variable typically gets: the random variable X = 0 and Y = ±1,000,000,000,000 with equal probability both have expectation 0, though their distributions are very different. Though it is impossible to summarize everything about the spread of a distribution in a single number, a useful approximation for many purposes is the variance Var(X) of a random variable X, which is defined as the expected square of the deviation from the expectation, or E((X-EX)²).

Example: Let X = 0 or 1 with equal probability. Then EX = 1/2, and (X-EX)² is always 1/4. So Var(X) = 1/4.
Example: Let X be the value of a fair six-sided die. Then EX = 7/2, and E((E-EX)²) = (1/6)((1-7/2)² + (2-7/2)² + (3-7/2)² + ... + (6 - 7/2)²) = 35/12.

Computing variance directly from the definition can be tedious. Often it is easier to compute it from E(X²) and EX:

Var[X] = E[(X-EX)²] = E[X² - 2 X EX + (EX)²] = E[X²] - 2 EX EX + (EX)² = E[X²] = (EX)².

(The second-to-last step uses linearity of expectation and the fact that EX is a constant.)

Example: For X = 0 or 1 with equal probability, we have E[X²] = 1/2 and (EX)² = 1/4, so Var[X] = 1/4.
Example: Let's try the six-sided die again, except this time we'll use an n-sided die. We have

$\begin{eqnarray*} Var[X] &=& E[X^2] - (E X)^2 \\ &=& \frac{1}{n} \sum_{i=1}^{n} i^2 - \left(\frac{n+1}{2}\right)^2 \\ &=& \frac{1}{n} \cdot \frac{n(n+1)(2n+1)}{6} - \frac{(n+1)^2}{4} \\ &=& \frac{(n+1)(2n+1)}{6} - \frac{(n+1)^2}{4}. \end{eqnarray*}$

When n = 6, this gives 7⋅13/6 - 49/4 = 35/12. (Ok, maybe it isn't always easier).

4.1. Multiplication by constants

Suppose we are asked to compute the variance of aX, where a is a constant. We have

Var[aX] = E[(aX)²] - E[aX]² = a²E[X²] - (aE[X])² = a² Var[X].

So, for example, if X = 0 or 2 with equal probability, Var[X] = 4⋅(1/4) = 1. This is exactly what we expect given that X-EX is always ±1.

Another consequence is that Var[-X] = (-1)²Var[X] = Var[X]. So variance is not affected by negation.

4.2. The variance of a sum

What is Var[X+Y]? Write

$\begin{eqnarray*} \mbox{Var}[X+Y] &=& E[(X+Y)^2] - \left(E[X+Y]\right)^2 \\ &=& E[X^2] + 2E[XY] + E[Y^2] - (EX)^2 - 2 EX \cdot EY - (EY)^2 \\ &=& (E[X^2] - (EX)^2) + (E[Y^2] - (EY)^2) + 2(E[XY] - EX \cdot EY) \\ &=& \mbox{Var}[X] + \mbox{Var}[Y] + 2(E[XY] - EX \cdot EY). \end{eqnarray*}$

The quantity E[XY] - EX EY is called the covariance of X and Y and is written Cov(X,Y). So we have just shown that

Var[X+Y] = Var[X] + Var[Y] + 2 Cov(X,Y).

When Cov(X,Y) = 0, or equivalently when E[XY] = EX EY, X and Y are said to be uncorrelated and their variances add. This occurs when X and Y are independent, but may also occur without X and Y being independent.

For larger sums the corresponding formula is

$\mbox{Var}\left[\sum_{i=1}^{n} X_i\right] = \sum_{i=1}^{n} \mbox{Var}[X_i] + \sum_{i\neq j} \mbox{Cov}(X,Y).$

This simplifies to Var(∑X_i) = ∑Var(X_i) when the X_i are pairwise independent, so that each pair of distinct X_i and X_j are independent. (This is implied by independence but is not equivalent to it.)

Example: What is the variance of the number of heads in n independent fair coin-flips? Let X_i be the indicator variable for the event that the i-th flip comes up heads and let X be the sum of the X_i. We have already seen that Var[X_i] = 1/4, so Var[X] = n Var[X_i] = n/4.
Example: Let a be a constant. What is the variance of X+a? We have Var[X+a] = Var[X] + Var[a] = Var[X], since (1) E[aX] = aE[X] = E[a]E[X] means that a (considered as a random variable) and X are uncorrelated, and (2) Var[a] = E[a-Ea] = E[a-a] = 0. So shifting a random variable up or down doesn't change its variance.

4.3. Chebyshev's inequality

Variance is an expectation, so we can use Markov's inequality on it. The result is Chebyshev's inequality:

Pr[|X - EX| > r) < Var[X] / r².

Proof: The event |X - EX| > r is the same as the event (X-EX)² > r². By Markov's inequality, the probability that this occurs is at most E[(E-EX)²]/r² = Var[X]/r².

4.3.1. Application: showing that a random variable is close to its expectation

This is the usual statistical application.

Example: Flip a fair coin n times, and let X be the number of heads. What is the probability that |X-n/2| > r? Recall that Var[X] = n/4, so Pr[|X-n/2| > r] < (n/4)/r² = n/(4r²). So, for example, the chances of deviating from the average by more than 1000 after 1000000 coin-flips is less than 1/4.
Example: Out of n voters in Saskaloosa County, m plan to vote for Smith for County Dogcatcher. A polling firm samples k voters (with replacement) and asks them who they plan to vote for. Suppose that m < n/2; compute a bound on the probability that the polling firm incorrectly polls a majority for Smith.

Solution: Let X_i be the indicator variable for a Smith vote when the i-th voter is polled and let X = ∑X_i be the total number of pollees who say they will vote for Smith. Let p = EX_i = m/n. Then Var[X_i] = p - p², EX = kp, and Var[X] = k(p-p²). To get a majority in the poll, we need X > k/2 or X-EX > k/2 - kp. Using Chebyshev's inequality, this event occurs with probability at most

$\begin{eqnarray*} \frac{\mbox{Var}[X]}{(k/2-kp)^2} &=& \frac{k(p - p^2)}{(k/2-kp)^2} \\ &=& \frac{1}{k} \cdot \frac{p-p^2}{(1/2 - p)^2}. \end{eqnarray*}$

Note that the bound decreases as k grows and (for fixed p) does not depend on n.

In practice, statisticians will use a stronger result called the Central limit theorem, which describes the shape of the distribution of the sum of many independent random variables much more accurately than the bound from Chebyshev's inequality. Designers of randomized algorithms are more likely to use Chernoff bounds.

4.3.2. Application: lower bounds on random variables

Unlike Markov's inequality, which can only show that a random variable can't be too big too often, Chebyshev's inequality can be used to show that a random variable can't be too small, by showing first that its expectation is high and then that its variance is low. For example, suppose that each of the 10³⁰ oxygen molecules in the room is close enough to your mouth to inhale with pairwise independent probability 10^-4 (it's a big room). Then the expected number of oxygen molecules near your mouth is a healthy 10³⁰ * 10^-4 = 10²⁶. What is the probability that all 10²⁶ of them escape your grasp?

Let X_i be the indicator variable for the event that the i-th molecule is close enough to inhale. We've effectively already used the fact that E[X_i] = 10^-4. To use Chebyshev's inequality, we also need Var[X_i] = E[X_i²] - E[X_i]² = 10^-4 - 10^-8 ~= 10^-4. So the total variance is about 10³⁰*10^-4 = 10²⁶ and Chebyshev's inequality says we have Pr[|X - EX| ≥ 10²⁶] ≤ 10²⁶/(10²⁶)² = 10^-26. So death by failure of statistical mechanics is unlikely (and the real probability is much much smaller).

But wait! Even a mere 90% drop in O₂ levels is going to be enough to cause problems. What is the probability that this happens? Again we can calculate Pr[90% drop] ≤ Pr[|X-EX| ≥ 0.9*10²⁶] ≤ 10²⁶/(0.9*10²⁶)² ~= 1.23*10^-26. So even temporary asphyxiation by statistical mechanics is not something to worry about.

5. Probability generating functions

For a discrete random variable X taking on only values in ℕ, we can express its distribution using a probability generating function or pgf:

F(z) = ∑ Pr[X=n] zⁿ.

These are essentially standard-issue GeneratingFunctions with the additional requirement that all coefficients are non-negative and F(1) = 1.

A trivial example is the pgf for a Bernoulli random variable (1 with probability p, 0 with probability q=1-p). Here the pgf is just q+pz.

A more complicated example is the pgf for a geometric random variable. Now we have ∑ q^k p z^k = p ∑ (qz)^k = p/(1-qz).

5.1. Sums

A very useful property of pgf's is that the pgf of a sum of independent random variables is just the product of the pgf's of the individual random variables. The reason for this is essentially the same as for ordinary GeneratingFunctions: when we multiply together two terms (Pr[X=n] zⁿ)(Pr[Y=m] z^m), we get Pr[X=n ∧ Y=m] z^n+m, and the sum over all the different ways of decomposing n+m gives all the different ways to get this sum.

So, for example, the pgf of a binomial random variable equal to the sum of n independent Bernoulli random variables is (q+pz)ⁿ (hence the name "binomial").

5.2. Expectation and variance

One nice thing about pgf's is that the can be used to quickly compute expectation and variance. For expectation, we have

F'(z) = ∑ n Pr[X=n] z^n-1.

F'(1) = ∑ n Pr[X=n] = E[X].

If we take the second derivative, we get

F''(z) = ∑ n(n-1) Pr[X=n] z^n-1

F''(1) = ∑ n(n-1) Pr[X=n] = E[X(X-1)] = E[X²] - E[X].

So we can recover E[X²] as F''(1) + F'(1) and get Var[X] as F''(1) + F'(1) - (F'(1))².

Example: If X is a Bernoulli random variable with pgf F = (q+pz), then F' = p and F'' = 0, giving E[X] = F'(1) = p and Var[X] = F''(1) + F'(1) - (F'(1))² = 0 + p - p² = p(1-p) = pq.
Example: If X is a binomial random variable with pgf F = (q+pz)ⁿ, then F' = n(q+pz)^n-1p and F'' = n(n-1)(q+pz)^n-2p², giving E[X] = F'(1) = np and Var[X] = F''(1) + F'(1) - (F'(1))² = n(n-1)p² + np - n²p² = np - np² = npq. (These values would, of course, be a lot faster to compute using the formulas for sums of independent random variables, but it's nice to see that they work.)
Example: If X is a geometric random variable with pgf p/(1-qz), then F' = pq/(1-qz)² and F'' = 2pq²/(1-qz)³. E[X] = F'(1) = pq/(1-q)² = pq/p² = q/p. Var[X] = F''(1) + F'(1) - (F'(1))² = 2pq²/(1-q)³ + q/p - q²/p² = 2q²/p² + q/p - q²/p² = q²/p² + q/p. This would probably be a pain to calculate by hand.
Example: Let X be a Poisson random variable with rate λ. We claimed earlier that a Poisson random variable is the limit of a sequence of binomial random variables where p = λ/n and n goes to infinity, so (cheating quite a bit) we expect that X's pgf F = lim[n⇒∞] ((1-λ/n) + (λ/n)z)ⁿ = (1+(-λ+λz)/n)ⁿ = exp(-λ+λz) = exp(-λ) ∑ λⁿzⁿ/n!. We can check that the total probability F(1) = exp(-λ+λ) = exp(0) = 1, that the expectation F'(1) = λ exp(-λ+λ) = λ, and that the variance F''(1) + F'(1) - (F'(1))² = λ²exp(-λ+λ) + λ - λ² = λ. These last two quantities are what we'd expect if we calculated the expectation and the variance directly as the limit of taking n Bernoulli random variables with expectation λ/n and variance (λ/n)(1-λ/n) each.

6. Summary: effects of operations on expectation and variance of random variables

Operation	Effect on expectation	Effect on variance
Multiplication by a constant	E[aX] = aE[X]	Var[aX] = a² Var[X]
Addition	E[X+Y] = E[X] + E[Y]. Does not require independence.	Var[X+Y] = Var[X] + Var[Y] + 2 Cov(X,Y). Note that Cov(X,Y)=0 if X and Y are independent.
Product	E[XY] = E[X]E[Y] + Cov(X,Y). (Or just E[X]E[Y] if X and Y are independent.)	(No simple formula.)

7. The general case

So far we have only considered discrete random variables, which avoids a lot of nasty technical issues. In general, a random variable on a probability space (Ω,F,P) is a function whose domain is Ω, which satisfies some extra conditions on its values that make interesting events involving the random variable elements of F. Typically the codomain will be the reals or the integers, although any set is possible. Random variables are generally written as capital letters with their arguments suppressed: rather than writing X(ω), where ω∈Ω, we write just X.

A technical condition on random variables is that the inverse image of any measurable subset of the codomain must be in F—in simple terms, if you can't nail down ω exactly, being able to tell which element of F you land in should be enough to determine the value of X(ω). For a discrete random variables, this just means that X^-1(x)∈F for each possible value x. For real-valued random variables the requirement is that the event [X ≤ x] is in F for any fixed x. In each case we say that X is measurable with respect to F (or just "measurable F").⁴ Usually we will not worry about this issue too much, but it may come up if we are varying F to represent different amounts of information available to different observers (e.g., if X and Y are the values of two dice, X is measurable to somebody who can see both dice but not to somebody who can only see the sum of the dice).

The distribution function of a real-valued random variable describes the probability that it takes on each of its possible values; it is specified by giving a function F(x) = Pr[X ≤ x]. The reason for using Pr[X ≤ x] instead of Pr[X = x] is that it allows specifying continuous random variables such as a random variable that is uniform in the range [0,1]; this random variable has a distribution function given by F(x) = x when 0 ≤ x ≤ 1, F(x) = 0 for x < 0, and F(x) = 1 for x > 1. For discrete random variables the distribution function will have discontinuous jumps at each possible value of the variable; for example, the distribution function of a variable X that is 0 or 1 with equal probability is F(x) = 0 for x < 0, 1/2 for 0 ≤ x < 1, and 1 for x ≥ 1.

Knowing the distribution of a random variable tells you what that variable might do by itself, but doesn't tell you how it interacts with other random variables: for example, if X is 0 or 1 with equal probability then X and 1-X both have the same distribution, but they are connected in a way that is not true for X and some independent variable Y with the same distribution. For multiple variables, a joint distribution gives the probability that each variable takes on a particular value; for example, if X and Y are two independent uniform samples from the range [0,1], their distribution function F(x,y) = Pr[X ≤ x ∧ Y ≤ y] = xy (when 0 ≤ x,y ≤ 1). If instead Y = 1-X, we get the distribution function F(x,y) = Pr[X ≤ x ∧ Y ≤ y] equal to x when y ≥ 1-x and 0 when y < 1-x (assuming 0 ≤ x,y ≤ 1).

We've seen that for discrete random variables, it is more useful to look at the function f(x) = Pr[X=x]. We will often treat such a function as specifying the distribution of X even if it is not technically a distribution function.

7.1. Densities

If a real-valued random variable is continuous in the sense of having a distribution function with no jumps (which means that it has probability 0 of landing on any particular value), we may be able to describe its distribution by giving a density instead. The density is the derivative of the distribution function. We can also think of it as a probability at each point defined in the limit, by taking smaller and smaller regions around the point and dividing the probability of landing in the region by the size of the region.

For example, the density of a uniform [0,1] random variable is f(x) = 1 for x∈[0,1], f(x) = 0 otherwise. For a uniform [0,2] random variable, we get a density of ½ throughout the [0,2] interval. The density always integrates to 1.

Some distributions are easier to describe using densities than using distribution functions. The normal distribution, which is of central importance in statistics, has density

$\frac{1}{\sqrt{2 \pi}} e^{-x^2/2}.$

Its distribution function is the integral of this quantity, which has no closed-form expression.

Joint densities also exist; the joint density of a pair of random variables with joint distribution function F(x,y) is given by the partial derivative f(x,y) = ∂²/∂x∂y F(x,y). The intuition here again is that we are approximating the (zero) probability at a point by taking the probability of a small region around the point and dividing by the area of the region.

7.2. Independence

Independence is the same as for discrete random variables: Two random variables X and Y are independent if any pair of events of the form X∈A, Y∈B are independent. For real-valued random variables it is enough to show that their joint distribution F(x,y) is equal to the product of their individual distributions F_X(x)F_Y(y). For real-valued random variables with densities, showing the densities multiply also works. Both methods generalize in the obvious way to sets of three or more random variables.

7.3. Expectation

If a continuous random variable has a density f(x), the formula is

$\E[X] = \int x f(x) dx.$

For continuous random variables without densities we land in a rather swampy end of integration theory. We will not talk about this case if we can help it. But in each case the expectation depends only on the distribution of X and not on its relationship to other random variables.

Example (continuous variable with density): Let X be a uniform random variable in the range [a,b]. Then E[X] = ∫[a,b] x 1/(b-a) dx = (1/(b-a))(x²/2)|_x=a..b = (1/(b-a))(b²/2 - a²/2) = (b+a)/2.

CategoryMathNotes

Technically, the function f(x) = Pr[X=x] is the mass function for X while the distribution function is F(x) = Pr[X≤x], but we can easily convert from one to the other. (1)
Technically, this will work for any values we can add and multiply by probabilities. So if X is actually a vector in ℝ³ (for example), we can talk about the expectation of X, which in some sense will be the average position of the location given by X. (2)
OK, my intuitions. (3)
The detail we are sweeping under the rug here is what makes a subset of the codomain measurable. The essential idea is that we also have a σ-algebra F' on the codomain, and elements of this codomain σ-algebra are the measurable subsets. The rules for simple random variables and real-valued random variables come from default choices of σ-algebra. (4)