[FrontPage] [TitleIndex] [WordIndex

Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

1. History and interpretation

Answers to these questions are summarized by a probability, a number in the range 0 to 1 that represents the likelihood that some event occurs. There are two dominant interpretations of this likelihood:

Frequentist interpretation
If an event occurs with probability p, then in the limit as I accumulate many examples of similar events, I will see the number of occurrences divided by the number of samples converging to p. For example, if I flip a fair coin over and over again many times, I expect that heads will come up roughly half of the times I flip it, because the probability of coming up heads is 1/2.
Bayesian interpretation
When I say that an event occurs with probability p, that means my subjective beliefs about the event would lead me to take a bet that would be profitable on average if this were the real probability. So a Bayesian would take a double-or-nothing bet on a coin coming up heads if they believed that the probability it came up heads was at least 1/2.

Frequentists and Bayesians have historically spent a lot of time arguing with each other over which interpretation makes sense. The usual argument against frequentist probability is that it only works for repeatable experiments, and doesn't allow for statements like "the probability that it will rain tomorrow is 50%" or the even more problematic "based on what I know, there is a 50% probability that it rained yesterday". The usual argument against Bayesian probability is that it's hopelessly subjective—it's possible (even likely) that my subjective guesses about the probability that it will rain tomorrow are not the same as yours. As mathematicians, we can ignore such arguments, and treat probability axiomatically as just another form of counting, where we normalize everything so that we always end up counting to exactly 1.

(Note: This caricature of the debate over interpreting probability is thoroughly incomplete. For a thoroughly complete discussion, see Probability, Interpretations Of at the Stanford Encyclopedia of Philosophy.)

2. Probability axioms

Coming up with axioms for probabilities that work in all the cases we want to consider took much longer than anybody expected, and the current set in common use only go back to the 1930's. Before presenting these, let's talk a bit about the basic ideas of probability.

An event A is something that might happen, or might not; it acts like a predicate over possible outcomes. The probability Pr[A] of an event A is a real number in the range 0 to 1, that must satisfy certain consistency rules like Pr[¬A] = 1-Pr[A].

In discrete probability, there is a finite set of atoms, each with an assigned probability, and every event is a union of atoms. The probability assigned to an event is the sum of the probabilities assigned to the atoms it contains. For example, we could consider rolling two six-sided dice. The atoms are the pairs (i,j) that give the value on the first and second die, and we assign a probability of 1/36 to each pair. The probability that we roll a 7 is the sum of the cases (1,6), (2,5), (3,4), (4,3), (5,2), and (6,1), or 6/36 = 1/6.

Discrete probability doesn't work if we have infinitely many atoms. Suppose we roll a pair of dice infinitely many times (e.g., because we want to know the probability that we never accumulate more 6's than 7's in this infinite sequence). Now there are infinitely many possible outcomes: all the sequences of pairs (i,j). If we make all these outcomes equally likely, we have to assign each a probability of zero. But then how do we get back to a probability of 1/6 that the first roll comes up 7?

2.1. The Kolmogorov axioms

A triple (S,F,P) is a probability space if S is a set of outcomes (where each outcome specifies everything that ever happens, in complete detail); F is a sigma-algebra, which is a family of subsets of S, called measurable sets, that is closed under complement (i.e., if A is in F then S-A is in F) and countable union (union of A1, A2, ... is in F if each set is); and P is a probability measure that assigns a number in [0,1] to each set in F. The measure P must satisfy three axioms:

  1. P(S) = 1.
  2. P(Ø) = 0.
  3. For any sequence of pairwise disjoint events A1, A2, A3, ..., P(∪Ai) = ∑P(Ai).

From these one can derive rules like P(S-A) = 1-P(A) etc.

Most of the time, S is finite, and we can just make F include all subsets of S, and define Pr[A] to be the sum of Pr[{x}] over all x in A; this gets us back to the discrete probability model we had before.

2.2. Examples of probability spaces

It's unusual for anybody doing probability to actually write out the details of the probability space like this. Much more often, a writer will just assert the probabilities of a few basic events (e.g. Pr[H] = 1/2), and claim that any other probability that can be deduced from these initial probabilities from the axioms also holds (e.g. Pr[T] = 1-Pr[H] = 1/2). The main reason Kolmogorov gets his name attached to the axioms is that he was responsible for Kolmogorov's existence theorem, which says (speaking very informally) that as long as your initial assertions are consistent, there exists a probability space that makes them and all their consequences true.

3. Probability as counting

The easiest probability space to work with is a uniform discrete probability space, which has N outcomes each of which occurs with probability 1/N. If someone announces that some quantity is "random" without specifying probabilities (especially if that someone is a computer scientist), the odds are that what they mean is that each possible value of the quantity is equally likely. If that someone is being more careful, they would say that the quantity is "drawn uniformly at random" from a particular set.

Such spaces are among the oldest studied in probability, and go back to the very early days of probability theory where randomness was almost always expressed in terms of pulling tokens out of well-mixed urns, because such urn models where one of the few situtations where everybody agreed on what the probabilities should be.

3.1. Examples

4. Independence and the intersection of two events

Events A and B are independent if Pr[A∩B] = Pr[A]·Pr[B]. In general, a group of events {Ai} is independent if each Ai is independent of any event defined only in terms of the other events.

It can be dangerous to assume that events are independent when they aren't, but quite often when describing a probability space we will explicitly state that certain events are independent. For example, one typically describes the space of random n-bit strings (or n coin flips) by saying that one has n independent random bits and then deriving that each particular sequence occurs with probability 2-n rather than starting with each sequence occuring with probability 2-n and then calculating that each particular bit is 1 with independent probability 1/2; the first description makes much more of the structure of the probability space explicitly, and so is more directly useful in calculation.

4.1. Examples

5. Union of two events

What is the probability of A∪B? If A and B are disjoint, then the axioms give Pr[A∪B] = Pr[A] + Pr[B]. But what if A and B are not disjoint?

By analogy to inclusion-exclusion in counting we would expect that

Intuitively, when we sum the probabilities of A and B we double-count the event that both occur, and must subtract it off to compensate. To prove this formally, consider the events A∩B, A∩¬B, and ¬A∩B. These are disjoint, so the probability of the union of any subset of this set of events is equal to the sum of its components. So in particular we have

5.1. Examples

6. Conditional probability

Suppose I want to answer the question "What is the probability that my dice add up to 6 if I know that the first one is an odd number?" This question involves conditional probability, where we calculate a probability subject to some conditions. The probability of an event A conditioned on an event B, written Pr(A|B), is defined by the formula

\[\Pr[A|B] = \frac{\Pr[A \cap B]}{\Pr[B]}.\]

One way to think about this is that when we assert that B occurs we are in effect replacing the entire probability space with just the part that sits in B. So we have to divide all of our probabilities by Pr[B] in order to make Pr[B|B] = 1, and we have to replace A with A∩B to exclude the part of A that can't happen any more.

Note also that conditioning on B only makes sense if Pr[B] > 0. If Pr[B] = 0, Pr[A|B] is undefined.

6.1. Conditional probabilities and intersections of non-independent events

Simple algebraic manipulation gives

\[\Pr[A\cap B] = \Pr[A|B] \Pr[B].\]

So one of the ways to compute the probability of two events occurring is to compute the probability of one of them, and the multiply by the probability that the second occurs conditioned on the first. For example, if my attempt to reach the summit of Mount Everest requires that I first learn how to climb mountains (Pr[B] = 10%) and then make it to the top safely (Pr[A|B] = 90%), then my chances of getting to the top are Pr[A∩B] = (90%)(10%) = 9%.

We can do this for sequences of events as well. Suppose that I have an urn that starts with k black balls and 1 red ball. In each of n trials I draw one ball uniformly at random from the urn. If it is red, I give up. If it is black, I put the ball back and add another black ball, thus increasing the number of balls by 1. What is the probability that on every trial I get a black ball?

Let Ai be the event that I get a black ball on the first i trials. Then Pr[A0] = 1, and for larger i we have Pr[Ai] = Pr[Ai|Ai-1] Pr[Ai-1]. If Ai-1 holds, then at the time of the i-th trial we have k+i total balls in the urn of which one is red. So the probability that we draw a black ball is 1-1/(k+i) = (k+i-1)/(k+i). By induction we can then show that

\Pr[A_i] = \prod_{j=1}^{i} \frac{k+j-1}{k+j}.

This is an example of a collapsing product, where the denominator of each fraction cancels out the numerator of the next; we are left only with the denominator k+i of the last term and the numerator k of the first, giving Pr[Ai] = k/(k+i). It follows that we make it through all n trials with probability Pr[An] = k/(k+n).

6.2. The law of total probability

We can use the fact that A is the disjoint union of A∧B and A∧¬B to get Pr[A] by case analysis:

\[\Pr[A] = \Pr[A\cap B] + \Pr[A \cap \overline{B}] = \Pr[A|B]\Pr[B] + \Pr[A|\overline{B}]\Pr[\overline{B}].\]

So, for example, there is a 20% chance I can make it to the top of Mt Everest safely without learning how to climb first, my chances of getting there go up to (90%)(10%) + (20%)(90%) = 27%.

This method is sometimes given the rather grandiose name of the law of total probability. The most general version is that if B1..Bn are all disjoint events, then

\Pr[A] = \sum_{i=1}^{n} \Pr[A|B_i] \Pr[B_i]

6.3. Bayes's formula

If one knows Pr[A|B], Pr[A|¬B], and Pr[B], it's possible to compute Pr[B|A]:

\[\Pr[B|A] = \frac{\Pr[A\cap B]}{\Pr[A]} = \frac{\Pr[A|B] \Pr[B]}{\Pr[A]}
= \frac{\Pr[A|B] \Pr[B]}{\Pr[A|B]\Pr[B] + \Pr[A|\overline{B}]\Pr[\overline{B}]}.\]

This formula is used heavily in statistics, where it goes by the name of Bayes's formula. Say that you have an Airport Terrorist Detector that lights up 75% of the time when inserted into the nostrils of a Terrorist, but lights up 0.1% of the time when inserted into the nostrils of a non-Terrorist. Suppose that for other reasons you know that Granny has only a 0.01% chance of being a Terrorist. What is the probability that Granny is a Terrorist if the detector lights up?

Let B be the event "Granny is a terrorist" and A the event "Detector lights up". Then Pr[B|A] = (75% * 0.01%)/(75% * 0.01% + 0.1% * 99.99%) ≅ 0.07495%. This examples show how even a small "false negative" rate can make it difficult to interpret the results of tests for rare conditions.

7. Random variables

A random variable X is a function from a probability space to some codomain; for more details see RandomVariables.


2014-06-17 11:58