[FrontPage] [TitleIndex] [WordIndex

Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

Linear algebra is the branch of algebra that studies vector spaces (see AlgebraicStructures for the definition of a vector space). The key concepts in linear algebra are independence, dimension, subspaces, and linear transformations (which for finite-dimensional spaces can be represented by matrices). There are two ways to look at linear algebra: as a set of rules for manipulating coordinates, vectors, vectors and matrices (concrete linear algebra—see LinearAlgebra) or as the properties of vector spaces considered in the abstract (abstract linear algebra). Both approaches get to essentially the same place, just in a different order. Here we start by assuming we have a vector space and derive coordinates, vectors, and matrices from it.

Linear combinations and independence

Let x1,...xn be vectors in some vector space V over a field F. A linear combination of the xi is a sum a1x1 + a2x2 + ... + anxn, where the ai are coefficients from F. The span of the xi is the set of all vectors that can be constructed by taking linear combinations of the xi. It is not hard to see that this is the smallest subspace of V that contains the xi, and we can think of it as the subspace generated by the xi just as a subgroup may be generated by some set of elements of a group. A vector y is independent of the vectors xi if it is not in their span.

A set of vectors xi is independent if each vector is independent of all the others; an equivalent, and more symmetric, definition is that a1x1 + a2x2 + ... anxn does not equal 0 unless all the ai are zero. We'll prove that these definitions are equivalent by showing that if any sequence xi has a nonzero linear combination that sums to 0, then there is some xi that can be expressed as a linear combination of its predecessors. (The other direction is easier, since if some xi can be expressed as a linear combination of the others, then subtracting xi from that linear combination gives 0.)

Theorem 1

Let ∑aixi=0, where not all ai are 0. Then for some k there are coefficients b1...bk-1 such that xk = ∑i<kbixi. Proof: Let k be the largest index for which ak≠0. Then solving ∑aixi=0 for xk gives xk = ∑i!=k(ai/ak)xi) = ∑i<k(ai/ak)xi).

Technical note: If the set of vectors xi is infinite, then only linear combinations with a finite number of nonzero coefficients are permitted. We will generally not consider vector spaces big enough for this to be an issue.


Linear independence is useful for establishing a coordinate system for a vector space. A basis of a vector space V is an independent set of vectors {xi} such that every vector y in V can be expressed as a linear combination y = a1x1 + a2x2 + ... + anxn. Given a particular basis, we can write any vector as the sequence (a1, a2, ... an). In fact, we can do so in only one way:

Theorem 2

If {xi} is a basis for V, then every vector y has a unique representation y = a1x1 + a2x2 + ... + anxn.


Suppose there is some y with more than one representation, i.e. there are sequences of coefficients ai and bi such that y = a1x1 + a2x2 + ... + anxn = b1x1 + b2x2 + ... + bnxn. Then 0 = y-y = a1x1 + a2x2 + ... + anxn - b1x1 + b2x2 + ... + bnxn = (a1-b1)x1 + (a2-b2)x2 + ... + (an-bn)xn. But since the xi are independent, the only way a linear combination of the xi can equal 0 is if all coefficients are 0, i.e. if ai = bi for all i.

Even better, we can do all of our usual vector space arithmetic in terms of the coefficients ai. For example, if a = ∑aixi and b = ∑bixi, then it can easily be verified that a+b = ∑(ai+bi)xi and ca = ∑(cai)xi.

However, it may be the case that the same vector will have different representations in different bases. For example, in the vector space ℝ2 consisting of pairs of real numbers, we could have a basis B1 = { (1,1), (0,1) } and a basis B2 = { (1,0), (1,-2) }. The vector (2,3) would be represented as (2,3) using basis B1 but would be represented as (5/2,-3/2) in basis B2. Usually we will pick a "standard basis" that will look something like B1. Generally it won't matter too much which basis we pick. Part of the reason for this is that for vector spaces with finite bases, all the bases look pretty much the same:

Theorem 3

Let x1...xn and y1...ym be two finite bases of the same vector space V. Then n=m.


Assume without loss of generality that n ≤ m. We will show how to replace elements of the xi basis with elements of the yi basis to produce a new basis consisting only of y1...yn. Start by considering the sequence y1,x1...xn. This sequence is not independent since y1 can be expressed as a linear combination of the xi (they're a basis). So from Theorem 1 there is some xi that can be expressed as a linear combination of y1,x1...xi-1. Swap this xi out to get a new sequence y1,x1...xi-1,xi+1,...xn. This new sequence is also a basis, because (a) any z can be expressed as a linear combination of these vectors by substituting the expansion of xi into the expansion of z in the original basis, and (b) it's independent, because if there is some nonzero linear combination that produces 0 we can substitute the expansion of xi to get a nonzero linear combination of the original basis that produces 0 as well. Now continue by constructing the sequence y2,y1,x1...xi-1,xi+1,...xn, and arguing that some xi' in this sequence must be expressible as a combination of earlier terms by Theorem 1 (it can't be y1 because then y2y1 is not independent), and drop this xi'. By repeating this process we can eventually eliminate all the xi, leaving the basis yn,...,y1. But then any yk for k > n would be a linear combination of this basis, so we must have m = n.

The size of any basis of a vector space is called the dimension of a vector space.

Linear transformations

Recall that linear transformation from a vector space U to a vector space V, both defined over the same field F, is a function A:U->V such that

Often these two equations are combined into the single equation

It is easy to show that A satisfies the first two equations if and only if it satisfies the combined one. Note that to avoid writing parentheses all the time, we often write the application of a linear transformation like multiplication: Ax = A(x).

Some examples of linear transformations:

Most of the linear transformations you will encounter are between finite-dimensional vector spaces. These have a particularly simple description in terms of matrices that we will discuss shortly. First, however, let's look at some properties of linear transformations in general.

Linear transformations form a vector space

Given linear transformations A and B and and a scalar c, define (A+B)x = Ax + Bx and (cA)x = c(Ax). Now we have:

So another source of vector spaces is transformations between existing vector spaces.

Exercise: go through each of these proofs and figure out just which property (e.g. commutativity of scalar multiplication, definition of cA, linearity of A) is used at each step.

Linear transformations form a ring

More interesting is the case of linear transformations from a vector space V to itself. For these we can not only define addition of transformations as above, but we can also define multiplication as composition: (AB)x = A(B(x)). This multiplication operation is associative and has an identity (I), and it distributes over addition of linear transformations: A(B+C)(x) = A(Bx+Cx) = ABx + ACx (the last step uses the fact that A is linear). So the linear transformations from a vector space to itself form a ring. In general, this ring will not be commutative unless the vector space is very small.

Inverses of linear transformations

As in any ring, some linear transformations on a single vector space V will have multiplicative inverses: a transformation A-1 such that AA-1 = A-1A = I. Since multiplication is really composition, a necessary and sufficient condition for A to have an inverse is that it be a bijection.

Necessity is easy here: for sufficiency we have to prove that the inverse of A as a function is also linear; i.e. that if A(cx+dy) = cAx + dAy for all scalars c and d and vectors x and y, then the same holds for A-1. Consider some c, d, r, and s where c and d are scalars and r and s are vectors in V. Let z = A-1(cr+ds) - (cA-1r + dA-1s). Then Az = A(A-1(cr+ds) - (cA-1r + dA-1s)) = (cr+ds) - A(cA-1r) - A(dA-1s) = cr + ds - cAA-1r - dAA-1s = cr + ds - cr - ds = 0. So Az = 0. We already know that A0 = 0, so if A is indeed a bijection we must have z = 0 implying A-1(cr+ds) = cA-1r + dA-1s. Since this holds for all c, d, r, and s, we have A-1 is a linear transformation.

From the proof we can see that an equivalent criterion for A to be invertible is that Az = 0 if and only if z = 0. (This is also what we'd expect based on the fact that A is a group homomorphism if we forget about scalars).


Suppose that U and V are both finite-dimensional, and we are provided with a basis x1...xn for U and y1...ym for V. In this case we can express any linear transformation from U to V using nm coefficients, which can conveniently be organized as a matrix (see Relations for the definition of a matrix).

Here's the idea: Let A:U->V be linear, and suppose we are given Axj for each basis vector xj. Then given any vector z in U, we can compute Az by applying A to the unique expansion z = ∑zixj like this:

But now we can further expand the vectors Axj in V using V's basis vectors. For each j, let Axj = ∑aijyi. Then

In other words, the i-th coordinate of Az is a sum of the coefficients aij times zj over all values of j. Each coefficient aij is in effect specifying how much of yi is contributed by each copy of xj.

Since the aij coefficients have two indices, we usually think of them as organized into an m-by-n matrix:


where as usual the first index of each coefficient specifies the row and the second specifies the column. (The rows-before-columns convention is also why we call this an m-by-n matrix instead of an n-by-m matrix when m is the number of rows and n is the number of columns.)

Where it will not cause confusion, it is customary to use the same letter (and capitalization) for both a linear transformation and the matrix that represents it. So we would write (for example):


Note that a particular matrix representation is only meaningful if we have already fixed a basis for both vector spaces; if for some reason the same transformation A is represented using two different sets of bases, it won't make sense to call the two different matrices that will arise in each by the same letter.

Vectors of dimension m are conventionally represented as m-by-1 matrices, i.e. as a single column:

\[\vec{z} = 

so that applying A to z looks like this:

A\vec{z} =

where the ugly-looking 2-by-1 matrix on the far right consists of two rows with only one entry in each row.

Matrix multiplication

Each operation on linear transformations (addition, scalar multiplication, multiplication) can be expressed in terms of the corresponding matrices.

For addition, we have (A+B)ij = Aij+Bij for each pair of indices i,j. The proof is that the yi coefficient of Az+Bz is given by ∑jAijzj + ∑jBijzj = ∑j(Aij+Bij)zj.

For scalar multiplication, we have (cA)ij = cAij. Here the yi coefficient of (cA)z is c∑jAijzj = j(cAij)zj.

Matrix multiplication, which corresponds to linear transformation multiplication (i.e. composition), is more complicated. Here we assume that yi = xi for all i, or in other words that we have a single basis for our single vector space. Given a vector z, we let zi be the coefficient on xi in the unique expansion z = ∑ zi xi. Now given transformations A and B, we wish to express the composite transformation AB as a matrix, by computing for each i,j the coefficient (AB)ij that expresses how much zj contributes to (ABz)i.

We can compute these coefficients as follows. Start with the fact that

(B\vec{z})_k = \sum_{j=1}^{n} B_{kj} z_j.


(A(B\vec{z}))_i &=& \sum_{k=1}^{n} A_{ik} (B\vec{z})_k \\
&=& \sum_{k=1}^{n} A_{ik} \sum_{j=1}^{n} B_{kj} z_j \\
&=& \sum_{j=1}^{n} \left(\sum_{k=1}^{n} A_{ik} B_{kj} \right) z_j.

From this we can immediately read off that

\[(AB)_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}.\]

In visual terms, to get the element in the i-th row and j-th column of AB, we take the i-th row of A and the j-th column of B, line them up, multiply matching elements, and sum the result. This operation of taking the sum of pairwise products of vector elements is known as a dot product and is of central importance in linear algebra.

Vectors as matrices

We previously defined for a linear transformation A from an n-dimensional to an m-dimensional space and a vector z = ∑jxj the product Az as ∑i (∑j Aijzj) yi, where the xj and yi are the basis vectors of the domain and codomain of A. If we drop the basis vectors and represent z as a matrix in the form of an n×1 column vector z with z1j = zj, then Az is an m×1 column vector with coordinates (Az)1i = ∑j Aijzj—precisely the coordinates of Az. So by representing vectors as matrices in column-vector form we can represent the application of a linear transformation as just another special case of matrix multiplication.

The transpose of a matrix

Given an n×m matrix A, define its transpose AT (sometimes written as A') as the m×n matrix with coefficients (AT)ij = Aji. In other words, in AT we reverse the indices of A, effectively flipping the matrix diagonally.

This is an easy operation to define, but what does it mean? Given a n×1 column vector x, xT is an 1×n row vector. Since it has n columns we can multiply it using ordinary matrix multiplication by an n×1 column vector y, say, giving (xTy)11 = ∑i x1iyi1. This is the dot product operation mentioned above, and is often abbreviated as x⋅y. Its interpretation is that taking the transpose of x converts it from a vector to a scalar-valued linear function on vectors. The set of all such scalar-valued linear functions on vectors is itself a vector space, called the dual space V* of the original ("primal") vector space V. For finite-dimensional spaces the dual space is isomorphic to its primal space, since (xT)T = x for all x.

What about bigger matrices? If we take some n×m matrix A representing a linear transformation A:U→V and compute its m×n transpose AT, we get a linear transformation from an m-dimensional space to an n-dimensional space. To understand what this matrix does, it is helpful to first look at the transpose of a product: ((AB)T)ik = (AB)ki = ∑j AkjBji = ∑j (BT)ij (AT)jk = (BTAT)ik. Or stated more concisely: (AB)T = BTAT. So now what does AT do to some vector y? We have ATy = (yTA)T. If we think of yT as an element of a dual space, yTA is also an element of the same dual space (since whenever we can multiply yTx we can also multiply (yTA)x = yT(Ax)). So AT acts like a linear transformation from V→U defined by a taking a detour through the dual space via ATy = (yTA)T.

The inverse of a matrix

Some matrices are invertible; some (including all the ones that aren't square) are not. To try to invert a matrix, we start with the pair of matrices A, I (where I is the identity matrix defined by Iii = 1 and Iij = 0 when i≠j), and multiply both sides of the pair from the left by a sequence of transformation matrices B1, B2, ... Bk until BkBk-1⋯B1A = I. At this point the right-hand matrix will be BkBk-1⋯B1 = A-1. (We could just keep track of all the Bi, but it's easier to keep track of their product.)

How do we pick the Bi? These will be matrices that (a) multiply some row by a scalar, (b) add a multiple of one row to another row, or (c) swap two rows. We'll use the first kind to make all the diagonal entries equal one, and the second kind to get zeroes in all the off-diagonal entries. The third kind will be saved for emergencies, like getting a zero on the diagonal.

That the operations (a), (b), and (c) correspond to multiplying by a matrix is provable but tedious.1 Given these operations, we can turn any invertible matrix A into I by working from the top down, rescaling each row i using a type (a) operation to make Aii = 1, then using a type (b) operation to subtract Aji times row i from each row j > i to zero out Aji, then finally repeating the same process starting at the bottom to zero out all the entries above the diagonal. The only way this can fail is if we hit some Aii = 0, which we can swap with a nonzero Aji if one exists (using a type (c) operation). If all the rows from i on down have a zero in the i column, then the original matrix A is not invertible. This entire process is known as Gauss-Jordan_elimination.

This procedure can be used to solve systems of linear equations: if Ax = b, we can compute x by first computing A-1 and then multiplying x = A-1Ax = A-1b. If we are not interested in A-1 for its own sake, we can simplify things by substituting b for I during the Gauss-Jordan elimination procedure; at the end, it will be transformed to x.


Original A is on the left, I on the right. We'll work in ℤ5 just to make things more entertaining, and to avoid having to write down a lot of fractions. Recall that in ℤ5, 2-1 = 3, 1-1 = 1, and 4-1 = 4.

Initial matrices:

2 0 1    1 0 0
1 0 1    0 1 0
3 1 2    0 0 1

Multiply top row by 3:

1 0 3    3 0 0
1 0 1    0 1 0
3 1 2    0 0 1

Subract top row from middle row and 3*top row from bottom row:

1 0 3    3 0 0
0 0 3    2 1 0
0 1 3    1 0 1

Swap middle and bottom rows:

1 0 3    3 0 0
0 1 3    1 0 1
0 0 3    2 1 0

Multiply bottom row by 2:

1 0 3    3 0 0
0 1 3    1 0 1
0 0 1    4 2 0

Subtract 3*bottom row from top and middle rows:

1 0 0    1 4 0
0 1 0    4 4 1
0 0 1    4 2 0

and we're done. (It's probably worth multiplying the original A by the alleged A-1 to make sure that we didn't make a mistake.)


Two vectors x and y in a real vector space are called orthogonal if x⋅y = xTy = 0. This is a symmetric relation, since y⋅x = yTx = (xTy)T = (x⋅y)T = x⋅y since y⋅x and x⋅y are both scalars (disguised as 1×1 matrices). In Euclidean spaces like our familiar ℝ3, two vectors are orthogonal precisely if they are perpendicular: e.g. (0,0,1)⋅(1,1,0) = 0⋅1+0⋅1+1⋅0 = 0, so a vector sticking directly out of the page (if that's what our third coordinate represents) is orthogonal to a vector running diagonally across the page, just as we would suspect from the orthogonal = perpendicular rule. In other spaces defining orthogonality in this way is more confusing; for example, in (ℤ2)3 the vectors (1,1,1) and (0,1,1) are orthogonal since 1⋅0+1⋅1+1⋅1 = 0+1+1 = 0, but it's not clear how we would draw these to make them look like they have a right angle between them, and the vectors with similar-looking coordinates in ℝ3 are not orthogonal since their dot-product is 2.

An important property of orthogonality is that it is preserved by scalar multiplication and vector addition. This follows from the facts that (ax)⋅y = a(xy) and x⋅(y+z) = xy + xz, which are simply restatements of the fact that xT is a linear transformation from the vector space to its underlying field. A consequence of this fact is that the set of vectors that are orthogonal to a particular vector x form a subspace.

Projections onto lines

Suppose that we can write a vector z = ax + y, where y is orthogonal to x. Then ax is called the projection of z onto the line (1-dimensional subspace) generated by x. The question arises: does such a projection always exist, and if it does, how do we find a?

The answer is that we take the dot-product of z with x. This gives zx = axx + yx = axx (since yx = 0). Solving for a gives a = zx/xx, and it is immediate from this solution that a is in fact the unique scalar such that z = ax + y with yx = 0.

Note that this only works if xx is nonzero. For vector spaces over the rationals or the reals this is always the case when x itself is nonzero. Here xx is just the square of the length of x as obtained by the Pythagorean theorem. But it may fail to be the case in spaces over fields with finite characteristic; for example, in ℤ23, (0,1,1)⋅(0,1,1) = 0, and there is no projection of z = (1,0,1) onto the line generated by (0,1,1).2 It can also fail in vector spaces over the complex numbers if we do nothing special about the dot product: when x = (i, 1) ∈ ℂ2 then xx = i2+12 = -1 + 1 = 0. For this reason the dot-product xy in complex spaces is usually defined by first taking the complex conjugate of x: the vector obtained by substituting -i for each occurrence of i in x. This gives a more reasonable measure of length (now (i,1)⋅(i,1) = -i⋅i + 1 = 2) and in fact allows projections to be taken in such spaces. But the easier solution—which we will adopt—is simply to stick with real-valued vector spaces when we talk about dot products.

Here's a picture of projection in magnificent ASCII_art:

y      z = ax + y
|     /|
|    / |
|   /  |
|  /   |
| /    |
|/     |
0     ax      x

One interpretation of the projection of z onto x is that a is the least-squares solution to the problem ax = z; that is, it is the solution that minimizes the sum of the squares of the errors in each coordinate (which is just yy, the square of the length of y = z - ax). This is useful for approximating the solution to inconsistent equations. For example, given the equations 2a = 3 and 3a = 4, we can rewrite them as a(2,3) = (3,4) and obtain a = (2,3)⋅(3,4)/(2,3)⋅(2,3) = (6+12)/(4+9) = 18/13. This is not actually a solution, since 2(18/13) = 2.76... is not 3 and 3(18/13) = 4.14... is not 4, but it's pretty close and it is impossible to do better without increasing the sum of the squares of the errors.

Projections onto higher-dimensional subspaces

What if we want to project onto a subspace that's not a line? We can still do so, but it's helpful to first generate an orthogonal basis for the subspace. This is a basis (maximal set of independent vectors) that is orthogonal in the sense that any two basis elements xi and xj are orthogonal.

Suppose we have an orthogonal basis, and we want to write z = y + ∑i aixi, where y is orthogonal to all of the xi. To find a particular ai, observe that y + ∑j≠i ajxj is orthogonal to xi. So in fact z = aixi + y + ∑j≠i ajxj is just an instance of the problem of projecting onto a line. We've already solved that problem: ai = zxi/xixi.

What if we don't have an orthogonal basis? Presumably we have some sort of basis, say x1...xk. If it's not orthogonal, we can make it orthogonal via projection: let y1 = x1, and for each i > 1 let yi be the solution to xi = yi + ∑j < i aj yj where yi is orthogonal to all the yj for j < i. Now use the yi as our orthogonal basis and project away.

(In real life, projection problems like this are solved by constructing a matrix A whose columns are the xi vectors, and solving for a = (ATA)-1ATz. The proof that this works is similar to the 1-dimensional case, and the resulting a gives the least-squares solution to the equation Aa = z.)

Further reading

Linear algebra is vitally important in ComputerScience: it is a key tool in graphics, scientific computing, robotics, neural networks, and many other areas. If you do further work in these areas, you will quickly find that we have not covered anywhere near enough linear algebra in this course. Your best strategy for remedying this deficiency may be to take an actual linear algebra course; failing that, a very approachable introductory text is Linear Algebra and Its Applications, by Gilbert Strang.

Some other useful books on linear algebra:

Matlab (which is available on the Zoo machines: type matlab at a shell prompt) is useful for playing around with operations on matrices. There are also various non-commercial knockoffs like Scilab or Octave that are not as comprehensive as Matlab but are adequate for most purposes. Note that with any of these tools, if you find yourselves doing lots of numerical computation, it is a good idea to talk to a numerical analyst about round-off error: the floating-point numbers inside computers are not the same as real numbers, and if you aren't careful about how you use them you can get very strange answers.


  1. The tedious details: to multiple row r by a, use a matrix B with Bii = 1 when i≠r, Brr=a, and Bij=0 for i≠j; to add a times row r to row s, use a matrix B with Bii = 1 when i≠r, Brs = a, and Bij=0 for all other pairs ij; to swap rows r and s, use a matrix B with Bii = 1 for i∉{r,s}, Brs = Bsr = 1, and Bij for all other pairs ij. (1)

  2. To check this, observe that the line generated by x = (0,1,1) contains only (0,0,0) and (0,1,1). But z-(0,0,0) = z is not orthogonal to x, and neither is z - (0,1,1) = (1,1,0). (2)

2014-06-17 11:57