Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

Contents

History and interpretation
Probability axioms
1. The Kolmogorov axioms
2. Examples of probability spaces
Probability as counting
1. Examples
Independence and the intersection of two events
1. Examples
Union of two events
1. Examples
Conditional probability
Random variables

1. History and interpretation

Gambling: I throw two six-sided dice, what are my chances of seeing a 7?
Insurance: I insure a typical resident of Smurfington-Upon-Tyne against premature baldness. How likely is it that I have to pay a claim?

Answers to these questions are summarized by a probability, a number in the range 0 to 1 that represents the likelihood that some event occurs. There are two dominant interpretations of this likelihood:

Frequentist interpretation: If an event occurs with probability p, then in the limit as I accumulate many examples of similar events, I will see the number of occurrences divided by the number of samples converging to p. For example, if I flip a fair coin over and over again many times, I expect that heads will come up roughly half of the times I flip it, because the probability of coming up heads is 1/2.
Bayesian interpretation: When I say that an event occurs with probability p, that means my subjective beliefs about the event would lead me to take a bet that would be profitable on average if this were the real probability. So a Bayesian would take a double-or-nothing bet on a coin coming up heads if they believed that the probability it came up heads was at least 1/2.

Frequentists and Bayesians have historically spent a lot of time arguing with each other over which interpretation makes sense. The usual argument against frequentist probability is that it only works for repeatable experiments, and doesn't allow for statements like "the probability that it will rain tomorrow is 50%" or the even more problematic "based on what I know, there is a 50% probability that it rained yesterday". The usual argument against Bayesian probability is that it's hopelessly subjective—it's possible (even likely) that my subjective guesses about the probability that it will rain tomorrow are not the same as yours. As mathematicians, we can ignore such arguments, and treat probability axiomatically as just another form of counting, where we normalize everything so that we always end up counting to exactly 1.

(Note: This caricature of the debate over interpreting probability is thoroughly incomplete. For a thoroughly complete discussion, see Probability, Interpretations Of at the Stanford Encyclopedia of Philosophy.)

2. Probability axioms

Coming up with axioms for probabilities that work in all the cases we want to consider took much longer than anybody expected, and the current set in common use only go back to the 1930's. Before presenting these, let's talk a bit about the basic ideas of probability.

An event A is something that might happen, or might not; it acts like a predicate over possible outcomes. The probability Pr[A] of an event A is a real number in the range 0 to 1, that must satisfy certain consistency rules like Pr[¬A] = 1-Pr[A].

In discrete probability, there is a finite set of atoms, each with an assigned probability, and every event is a union of atoms. The probability assigned to an event is the sum of the probabilities assigned to the atoms it contains. For example, we could consider rolling two six-sided dice. The atoms are the pairs (i,j) that give the value on the first and second die, and we assign a probability of 1/36 to each pair. The probability that we roll a 7 is the sum of the cases (1,6), (2,5), (3,4), (4,3), (5,2), and (6,1), or 6/36 = 1/6.

Discrete probability doesn't work if we have infinitely many atoms. Suppose we roll a pair of dice infinitely many times (e.g., because we want to know the probability that we never accumulate more 6's than 7's in this infinite sequence). Now there are infinitely many possible outcomes: all the sequences of pairs (i,j). If we make all these outcomes equally likely, we have to assign each a probability of zero. But then how do we get back to a probability of 1/6 that the first roll comes up 7?

2.1. The Kolmogorov axioms

A triple (S,F,P) is a probability space if S is a set of outcomes (where each outcome specifies everything that ever happens, in complete detail); F is a sigma-algebra, which is a family of subsets of S, called measurable sets, that is closed under complement (i.e., if A is in F then S-A is in F) and countable union (union of A₁, A₂, ... is in F if each set is); and P is a probability measure that assigns a number in [0,1] to each set in F. The measure P must satisfy three axioms:

P(S) = 1.
P(Ø) = 0.
For any sequence of pairwise disjoint events A₁, A₂, A₃, ..., P(∪A_i) = ∑P(A_i).

From these one can derive rules like P(S-A) = 1-P(A) etc.

Most of the time, S is finite, and we can just make F include all subsets of S, and define Pr[A] to be the sum of Pr[{x}] over all x in A; this gets us back to the discrete probability model we had before.

2.2. Examples of probability spaces

S = {H,T}, F = ℘S, Pr[A] = |A|/2. This represents a fair coin with two outcomes H and T that each occur with probability 1/2.
S = {H,T}, F = ℘S, Pr[{H}] = p, Pr[{T}] = 1-p. This represents a biased coin, where H comes up with probability p.
S = { (i,j): i,j∈{1,2,3,4,5,6} }, F = ℘S, Pr[A] = |A|/36. Roll of two fair dice. A typical event might be "the total roll is 4", which is the set { (1,3), (2,2), (3,1) } with probability 3/36 = 1/12.
S = ℕ, F = ℘S, Pr[A] = ∑_n∈A 2^-n-1. This is an infinite probability space; a real-world process that might generate it is to flip a fair coin repeatedly and count how many times it comes up tails before the first time it comes up heads. Note that even though it is infinite, we can still define all probabilities by summing over atoms: Pr[0] = 1/2, Pr[1] = 1/4, Pr[2] = 1/8, etc.

It's unusual for anybody doing probability to actually write out the details of the probability space like this. Much more often, a writer will just assert the probabilities of a few basic events (e.g. Pr[H] = 1/2), and claim that any other probability that can be deduced from these initial probabilities from the axioms also holds (e.g. Pr[T] = 1-Pr[H] = 1/2). The main reason Kolmogorov gets his name attached to the axioms is that he was responsible for Kolmogorov's existence theorem, which says (speaking very informally) that as long as your initial assertions are consistent, there exists a probability space that makes them and all their consequences true.

3. Probability as counting

The easiest probability space to work with is a uniform discrete probability space, which has N outcomes each of which occurs with probability 1/N. If someone announces that some quantity is "random" without specifying probabilities (especially if that someone is a computer scientist), the odds are that what they mean is that each possible value of the quantity is equally likely. If that someone is being more careful, they would say that the quantity is "drawn uniformly at random" from a particular set.

Such spaces are among the oldest studied in probability, and go back to the very early days of probability theory where randomness was almost always expressed in terms of pulling tokens out of well-mixed urns, because such urn models where one of the few situtations where everybody agreed on what the probabilities should be.

3.1. Examples

A random bit has two outcomes, 0 and 1, and each occurs with probability 1/2.
A die roll has six outcomes, 1 through 6, and each occurs with probability 1/6.
A roll of two dice has 36 outcomes (order of the dice matters); each occurs with probability 1/36.
A random n-bit string of length n has 2ⁿ outcomes; each occurs with probability 2^-n. The probability that exactly 1 bit is a 1 is obtained by counting all strings with a single 1 and dividing by 2ⁿ: n2^-n.
A poker hand consists of a subset of 5 cards drawn uniformly at random from a deck of 52 cards. Depending on whether the order of the 5 cards is considered important (usually it isn't), there are either ${52 \choose 5}$ or (52)₅ possible hands. The probability of getting a flush (all 5 cards in the hand drawn from the same suit of 13 cards) is $4{13 \choose 5}/{52 \choose 5}$ ; there are 4 choices of suits, and ${13 \choose 5}$ ways to draw 5 cards from each suit.
A random permutation on n items has n! outcomes, one for each possible permutation. A typical event might be that the first element of a random permutation of 1..n is 1; this occurs with probability (n-1)!/n! = 1/n. Another example of a random permutation might be a uniform shuffling of a 52-card deck (difficult to achieve in practice!): here the probability that we get a particular set of 5 cards as the first 5 in the deck is obtained by counting all the permutations that have those 5 cards in the first 5 positions (there are 5!×47! of them) divided by 52!. The result is the same $1/{52 \choose 5}$ that we get from the uniform poker hands.

4. Independence and the intersection of two events

Events A and B are independent if Pr[A∩B] = Pr[A]·Pr[B]. In general, a group of events {A_i} is independent if each A_i is independent of any event defined only in terms of the other events.

It can be dangerous to assume that events are independent when they aren't, but quite often when describing a probability space we will explicitly state that certain events are independent. For example, one typically describes the space of random n-bit strings (or n coin flips) by saying that one has n independent random bits and then deriving that each particular sequence occurs with probability 2^-n rather than starting with each sequence occuring with probability 2^-n and then calculating that each particular bit is 1 with independent probability 1/2; the first description makes much more of the structure of the probability space explicitly, and so is more directly useful in calculation.

4.1. Examples

What is the probability of getting two heads on independent coin flips? Pr[H₁∩H₂] = (1/2)(1/2) = 1/4.
Suppose the coin-flips are not independent (maybe the two coins are glued together). What is the probability of getting two heads? This can range anywhere from zero (coin 2 always comes up the opposite of coin 1) to 1/2 (if coin 1 comes up heads, so does coin 2).
What is the probability that both you and I draw a flush (all 5 cards the same suit) from the same poker deck? Since we are fighting over the same collection of same-suit subsets, we'd expect Pr[A∩B] < Pr[A] Pr[B]—the event that you get a flush (A) is not independent of the event that I get a flush (B), and we'd have to calculate the probability of both by counting all ways to draw two hands that are both flushes. But if we put your cards back and then shuffle the deck again, the events in this new case are independent, and we can just square the Pr[flush] we calculated before.
Suppose the Red Sox play the Yankees. What is the probability that the final score is exactly 4-4? Amazingly, it appears that it is equal to Pr[Red Sox score 4 runs against the Yankees]*Pr[Yankees score 4 runs against the Red Sox] (see http://arXiv.org/abs/math/0509698 for a paper discussing this issue)—to the extent we can measure the underlying probability distribution, the score of each team in a professional baseball game appears to be independent of the score of the other team.

5. Union of two events

What is the probability of A∪B? If A and B are disjoint, then the axioms give Pr[A∪B] = Pr[A] + Pr[B]. But what if A and B are not disjoint?

By analogy to inclusion-exclusion in counting we would expect that

Pr[A∪B] = Pr[A] + Pr[B] - Pr[A∩B].

Intuitively, when we sum the probabilities of A and B we double-count the event that both occur, and must subtract it off to compensate. To prove this formally, consider the events A∩B, A∩¬B, and ¬A∩B. These are disjoint, so the probability of the union of any subset of this set of events is equal to the sum of its components. So in particular we have

Pr[A] + Pr[B] - Pr[A∩B] = (Pr[A∩B] + Pr[A∩¬B]) + (Pr[A∩B] + Pr[¬A∩B]) - Pr[A∩B] = Pr[A∩B] + Pr[A∩¬B] + Pr[¬A∩B] = Pr[A∪B].

5.1. Examples

What is the probability of getting at least one head out of two independent coin-flips? Pr[H₁∩H₂] = 1/2 + 1/2 - (1/2)(1/2) = 3/4.
What is the probability of getting at least one head out of two coin-flips, when the coin-flips are not independent? Here again we can get any probability from 0 to 1, depending on what 1-Pr[H₁H₂] is.

6. Conditional probability

Suppose I want to answer the question "What is the probability that my dice add up to 6 if I know that the first one is an odd number?" This question involves conditional probability, where we calculate a probability subject to some conditions. The probability of an event A conditioned on an event B, written Pr(A|B), is defined by the formula

$\Pr[A|B] = \frac{\Pr[A \cap B]}{\Pr[B]}.$

One way to think about this is that when we assert that B occurs we are in effect replacing the entire probability space with just the part that sits in B. So we have to divide all of our probabilities by Pr[B] in order to make Pr[B|B] = 1, and we have to replace A with A∩B to exclude the part of A that can't happen any more.

Note also that conditioning on B only makes sense if Pr[B] > 0. If Pr[B] = 0, Pr[A|B] is undefined.

6.1. Conditional probabilities and intersections of non-independent events

Simple algebraic manipulation gives

$\Pr[A\cap B] = \Pr[A|B] \Pr[B].$

So one of the ways to compute the probability of two events occurring is to compute the probability of one of them, and the multiply by the probability that the second occurs conditioned on the first. For example, if my attempt to reach the summit of Mount Everest requires that I first learn how to climb mountains (Pr[B] = 10%) and then make it to the top safely (Pr[A|B] = 90%), then my chances of getting to the top are Pr[A∩B] = (90%)(10%) = 9%.

We can do this for sequences of events as well. Suppose that I have an urn that starts with k black balls and 1 red ball. In each of n trials I draw one ball uniformly at random from the urn. If it is red, I give up. If it is black, I put the ball back and add another black ball, thus increasing the number of balls by 1. What is the probability that on every trial I get a black ball?

Let A_i be the event that I get a black ball on the first i trials. Then Pr[A₀] = 1, and for larger i we have Pr[A_i] = Pr[A_i|A_i-1] Pr[A_i-1]. If A_i-1 holds, then at the time of the i-th trial we have k+i total balls in the urn of which one is red. So the probability that we draw a black ball is 1-1/(k+i) = (k+i-1)/(k+i). By induction we can then show that

$\Pr[A_i] = \prod_{j=1}^{i} \frac{k+j-1}{k+j}.$

This is an example of a collapsing product, where the denominator of each fraction cancels out the numerator of the next; we are left only with the denominator k+i of the last term and the numerator k of the first, giving Pr[A_i] = k/(k+i). It follows that we make it through all n trials with probability Pr[A_n] = k/(k+n).

6.2. The law of total probability

We can use the fact that A is the disjoint union of A∧B and A∧¬B to get Pr[A] by case analysis:

$\Pr[A] = \Pr[A\cap B] + \Pr[A \cap \overline{B}] = \Pr[A|B]\Pr[B] + \Pr[A|\overline{B}]\Pr[\overline{B}].$

So, for example, there is a 20% chance I can make it to the top of Mt Everest safely without learning how to climb first, my chances of getting there go up to (90%)(10%) + (20%)(90%) = 27%.

This method is sometimes given the rather grandiose name of the law of total probability. The most general version is that if B₁..B_n are all disjoint events, then

$\Pr[A] = \sum_{i=1}^{n} \Pr[A|B_i] \Pr[B_i]$

6.3. Bayes's formula

If one knows Pr[A|B], Pr[A|¬B], and Pr[B], it's possible to compute Pr[B|A]:

$\Pr[B|A] = \frac{\Pr[A\cap B]}{\Pr[A]} = \frac{\Pr[A|B] \Pr[B]}{\Pr[A]} = \frac{\Pr[A|B] \Pr[B]}{\Pr[A|B]\Pr[B] + \Pr[A|\overline{B}]\Pr[\overline{B}]}.$

This formula is used heavily in statistics, where it goes by the name of Bayes's formula. Say that you have an Airport Terrorist Detector that lights up 75% of the time when inserted into the nostrils of a Terrorist, but lights up 0.1% of the time when inserted into the nostrils of a non-Terrorist. Suppose that for other reasons you know that Granny has only a 0.01% chance of being a Terrorist. What is the probability that Granny is a Terrorist if the detector lights up?

Let B be the event "Granny is a terrorist" and A the event "Detector lights up". Then Pr[B|A] = (75% * 0.01%)/(75% * 0.01% + 0.1% * 99.99%) ≅ 0.07495%. This examples show how even a small "false negative" rate can make it difficult to interpret the results of tests for rare conditions.

7. Random variables

A random variable X is a function from a probability space to some codomain; for more details see RandomVariables.

CategoryMathNotes