Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

Linear algebra is the branch of algebra that studies vector spaces (see AlgebraicStructures for the definition of a vector space). The key concepts in linear algebra are independence, dimension, subspaces, and linear transformations (which for finite-dimensional spaces can be represented by matrices). There are two ways to look at linear algebra: as a set of rules for manipulating coordinates, vectors, vectors and matrices (concrete linear algebra—see LinearAlgebra) or as the properties of vector spaces considered in the abstract (abstract linear algebra). Both approaches get to essentially the same place, just in a different order. Here we start by assuming we have a vector space and derive coordinates, vectors, and matrices from it.

Linear combinations and independence

Let x₁,...x_n be vectors in some vector space V over a field F. A linear combination of the x_i is a sum a₁x₁ + a₂x₂ + ... + a_nx_n, where the a_i are coefficients from F. The span of the x_i is the set of all vectors that can be constructed by taking linear combinations of the x_i. It is not hard to see that this is the smallest subspace of V that contains the x_i, and we can think of it as the subspace generated by the x_i just as a subgroup may be generated by some set of elements of a group. A vector y is independent of the vectors x_i if it is not in their span.

A set of vectors x_i is independent if each vector is independent of all the others; an equivalent, and more symmetric, definition is that a₁x₁ + a₂x₂ + ... a_nx_n does not equal 0 unless all the a_i are zero. We'll prove that these definitions are equivalent by showing that if any sequence x_i has a nonzero linear combination that sums to 0, then there is some x_i that can be expressed as a linear combination of its predecessors. (The other direction is easier, since if some x_i can be expressed as a linear combination of the others, then subtracting x_i from that linear combination gives 0.)

Theorem 1: Let ∑a_ix_i=0, where not all a_i are 0. Then for some k there are coefficients b₁...b_k-1 such that x_k = ∑_i<kb_ix_i. Proof: Let k be the largest index for which a_k≠0. Then solving ∑a_ix_i=0 for x_k gives x_k = ∑_i!=k(a_i/a_k)x_i) = ∑_i<k(a_i/a_k)x_i).

Technical note: If the set of vectors x_i is infinite, then only linear combinations with a finite number of nonzero coefficients are permitted. We will generally not consider vector spaces big enough for this to be an issue.

Bases

Linear independence is useful for establishing a coordinate system for a vector space. A basis of a vector space V is an independent set of vectors {x_i} such that every vector y in V can be expressed as a linear combination y = a₁x₁ + a₂x₂ + ... + a_nx_n. Given a particular basis, we can write any vector as the sequence (a₁, a₂, ... a_n). In fact, we can do so in only one way:

Theorem 2: If {x_i} is a basis for V, then every vector y has a unique representation y = a₁x₁ + a₂x₂ + ... + a_nx_n.
Proof: Suppose there is some y with more than one representation, i.e. there are sequences of coefficients a_i and b_i such that y = a₁x₁ + a₂x₂ + ... + a_nx_n = b₁x₁ + b₂x₂ + ... + b_nx_n. Then 0 = y-y = a₁x₁ + a₂x₂ + ... + a_nx_n - b₁x₁ + b₂x₂ + ... + b_nx_n = (a₁-b₁)x₁ + (a₂-b₂)x₂ + ... + (a_n-b_n)x_n. But since the x_i are independent, the only way a linear combination of the x_i can equal 0 is if all coefficients are 0, i.e. if a_i = b_i for all i.

Even better, we can do all of our usual vector space arithmetic in terms of the coefficients a_i. For example, if a = ∑a_ix_i and b = ∑b_ix_i, then it can easily be verified that a+b = ∑(a_i+b_i)x_i and ca = ∑(ca_i)x_i.

However, it may be the case that the same vector will have different representations in different bases. For example, in the vector space ℝ² consisting of pairs of real numbers, we could have a basis B₁ = { (1,1), (0,1) } and a basis B₂ = { (1,0), (1,-2) }. The vector (2,3) would be represented as (2,3) using basis B₁ but would be represented as (5/2,-3/2) in basis B₂. Usually we will pick a "standard basis" that will look something like B₁. Generally it won't matter too much which basis we pick. Part of the reason for this is that for vector spaces with finite bases, all the bases look pretty much the same:

Theorem 3: Let x₁...x_n and y₁...y_m be two finite bases of the same vector space V. Then n=m.
Proof: Assume without loss of generality that n ≤ m. We will show how to replace elements of the x_i basis with elements of the y_i basis to produce a new basis consisting only of y₁...y_n. Start by considering the sequence y₁,x₁...x_n. This sequence is not independent since y₁ can be expressed as a linear combination of the x_i (they're a basis). So from Theorem 1 there is some x_i that can be expressed as a linear combination of y₁,x₁...x_i-1. Swap this x_i out to get a new sequence y₁,x₁...x_i-1,x_i+1,...x_n. This new sequence is also a basis, because (a) any z can be expressed as a linear combination of these vectors by substituting the expansion of x_i into the expansion of z in the original basis, and (b) it's independent, because if there is some nonzero linear combination that produces 0 we can substitute the expansion of x_i to get a nonzero linear combination of the original basis that produces 0 as well. Now continue by constructing the sequence y₂,y₁,x₁...x_i-1,x_i+1,...x_n, and arguing that some x_i' in this sequence must be expressible as a combination of earlier terms by Theorem 1 (it can't be y₁ because then y₂y₁ is not independent), and drop this x_i'. By repeating this process we can eventually eliminate all the x_i, leaving the basis y_n,...,y₁. But then any y_k for k > n would be a linear combination of this basis, so we must have m = n.

The size of any basis of a vector space is called the dimension of a vector space.

Linear transformations

Recall that linear transformation from a vector space U to a vector space V, both defined over the same field F, is a function A:U->V such that

A(ax) = aA(x) for all a in in F and x in U.
A(x+y) = A(x)+A(y) for all x and y in U.

Often these two equations are combined into the single equation

A(ax+by) = aA(x)+bA(y).

It is easy to show that A satisfies the first two equations if and only if it satisfies the combined one. Note that to avoid writing parentheses all the time, we often write the application of a linear transformation like multiplication: Ax = A(x).

Some examples of linear transformations:

The identity transformation I:V->V defined by Ix = x is a linear transformation: I(ax+by) = ax + by = aIx + bIy.
Let U = Fⁿ, and let A(x₁,x₂..._xn_{) = x}1,,. Then A is a linear transformation from U to F¹, called the projection of x onto its first coordinate.`
For each scalar a, the map x -> ax is a linear transformation.
Let U and V both be the vector space of real-valued random variables on some probability space. Then the expectation operator E is a linear transformation from U to U: E[aX+bY] = aE[X] + bE[Y].
Let U be the space of differentiable functions from ℝ to ℝ and let V be the space of integrable functions from ℝ to ℝ. Then the differentiation operator D:U->V and the integration operator S:V->U are both linear transformations.

Most of the linear transformations you will encounter are between finite-dimensional vector spaces. These have a particularly simple description in terms of matrices that we will discuss shortly. First, however, let's look at some properties of linear transformations in general.

Linear transformations form a vector space

Given linear transformations A and B and and a scalar c, define (A+B)x = Ax + Bx and (cA)x = c(Ax). Now we have:

The set of linear transformations from U to V, with addition, is an abelian group. Proof:
- Given A and B, we can show (A+B) is also a linear transform by observing that (A+B)(ax+by) = A(ax+by) + B(ax+by) = aAx + bAy + aBx + aBy = a(Ax+Bx) + b(Ay+By) = a(A+B)x + a(A+B)y.
- Given A, let -A be defined by (-A)x = -(Ax). Then (a) -A is linear, because (-A)(ax+by) = -(A(ax+by)) = -(aAx+bAy) = a(-(Ax)) + b(-(Ay)) [note use of field axioms here] = a((-A)x) + b((-A)y); and (b) A + -A = 0 since Ax + (-A)x = Ax + -Ax = 0.
- The 0 map defined by 0x = 0 is an identity since Ax + 0x = Ax + 0 = Ax for all x.
The set of linear transformations is closed under scalar multiplication: If A is linear then so is cA, since (cA)(ax+by) = c(A(ax+by) = c(aAx+bAy) = (ca)Ax + (cb)Ay = (ac)Ax + (ac)Ay = a(cA)x+a(cA)y.
The four remaining vector space axioms all hold:
- (a(bA))x = a(b(Ax)) = (ab)(Ax) = ((ab)A)x).
- (1A)x = 1(Ax) = Ax.
- (c(A+B))x = c((A+B)x) = c(Ax + Bx) = c(Ax) + c(Bx) = (cA)x + (cB)x.
- ((c+d)A)x = (c+d)(Ax) = c(Ax) + d(Ax) = (cA)x + (dA)x.

So another source of vector spaces is transformations between existing vector spaces.

Exercise: go through each of these proofs and figure out just which property (e.g. commutativity of scalar multiplication, definition of cA, linearity of A) is used at each step.

Linear transformations form a ring

More interesting is the case of linear transformations from a vector space V to itself. For these we can not only define addition of transformations as above, but we can also define multiplication as composition: (AB)x = A(B(x)). This multiplication operation is associative and has an identity (I), and it distributes over addition of linear transformations: A(B+C)(x) = A(Bx+Cx) = ABx + ACx (the last step uses the fact that A is linear). So the linear transformations from a vector space to itself form a ring. In general, this ring will not be commutative unless the vector space is very small.

Inverses of linear transformations

As in any ring, some linear transformations on a single vector space V will have multiplicative inverses: a transformation A^-1 such that AA^-1 = A^-1A = I. Since multiplication is really composition, a necessary and sufficient condition for A to have an inverse is that it be a bijection.

Necessity is easy here: for sufficiency we have to prove that the inverse of A as a function is also linear; i.e. that if A(cx+dy) = cAx + dAy for all scalars c and d and vectors x and y, then the same holds for A^-1. Consider some c, d, r, and s where c and d are scalars and r and s are vectors in V. Let z = A^-1(cr+ds) - (cA^-1r + dA^-1s). Then Az = A(A^-1(cr+ds) - (cA^-1r + dA^-1s)) = (cr+ds) - A(cA^-1r) - A(dA^-1s) = cr + ds - cAA^-1r - dAA^-1s = cr + ds - cr - ds = 0. So Az = 0. We already know that A0 = 0, so if A is indeed a bijection we must have z = 0 implying A^-1(cr+ds) = cA^-1r + dA^-1s. Since this holds for all c, d, r, and s, we have A^-1 is a linear transformation.

From the proof we can see that an equivalent criterion for A to be invertible is that Az = 0 if and only if z = 0. (This is also what we'd expect based on the fact that A is a group homomorphism if we forget about scalars).

Matrices

Suppose that U and V are both finite-dimensional, and we are provided with a basis x₁...x_n for U and y₁...y_m for V. In this case we can express any linear transformation from U to V using nm coefficients, which can conveniently be organized as a matrix (see Relations for the definition of a matrix).

Here's the idea: Let A:U->V be linear, and suppose we are given Ax_j for each basis vector x_j. Then given any vector z in U, we can compute Az by applying A to the unique expansion z = ∑z_ix_j like this:

Az = A(∑z_jx_j) = ∑z_j(Ax_j).

But now we can further expand the vectors Ax_j in V using V's basis vectors. For each j, let Ax_j = ∑a_ijy_i. Then

Az = ∑_jz_j(Ax_j) = ∑_jz_j(∑_ia_ijy_i) = ∑_i (∑_j a_ijz_j) y_i.

In other words, the i-th coordinate of Az is a sum of the coefficients a_ij times z_j over all values of j. Each coefficient a_ij is in effect specifying how much of y_i is contributed by each copy of x_j.

Since the a_ij coefficients have two indices, we usually think of them as organized into an m-by-n matrix:

$\left(\begin{array}{ccc} a_{11}&a_{12}&a_{13}\\ a_{21}&a_{22}&a_{23} \end{array}\right)$

where as usual the first index of each coefficient specifies the row and the second specifies the column. (The rows-before-columns convention is also why we call this an m-by-n matrix instead of an n-by-m matrix when m is the number of rows and n is the number of columns.)

Where it will not cause confusion, it is customary to use the same letter (and capitalization) for both a linear transformation and the matrix that represents it. So we would write (for example):

$A= \left(\begin{array}{ccc} A_{11}&A_{12}&A_{13}\\ A_{21}&A_{22}&A_{23} \end{array}\right)$

Note that a particular matrix representation is only meaningful if we have already fixed a basis for both vector spaces; if for some reason the same transformation A is represented using two different sets of bases, it won't make sense to call the two different matrices that will arise in each by the same letter.

Vectors of dimension m are conventionally represented as m-by-1 matrices, i.e. as a single column:

$\vec{z} = \left(\begin{array}{c}z_1\\z_2\\z_3\end{array}\right)$

so that applying A to z looks like this:

$A\vec{z} = \left(\begin{array}{ccc} A_{11}&A_{12}&A_{13}\\ A_{21}&A_{22}&A_{23} \end{array}\right) \left(\begin{array}{c}z_1\\z_2\\z_3\end{array}\right) = \left(\begin{array}{c} (A_{11}z_1+A_{12}z_2+A_{13}z_3)\\ (A_{21}z_1+A_{22}z_2+A_{23}z_3) \end{array}\right)$

where the ugly-looking 2-by-1 matrix on the far right consists of two rows with only one entry in each row.

Matrix multiplication

Each operation on linear transformations (addition, scalar multiplication, multiplication) can be expressed in terms of the corresponding matrices.

For addition, we have (A+B)_ij = A_ij+B_ij for each pair of indices i,j. The proof is that the y_i coefficient of Az+Bz is given by ∑_jA_ijz_j + ∑_jB_ijz_j = ∑_j(A_ij+B_ij)z_j.

For scalar multiplication, we have (cA)_ij = cA_ij. Here the y_i coefficient of (cA)z is c∑_jA_ijz_j = ∑_j(cA_ij)z_j.

Matrix multiplication, which corresponds to linear transformation multiplication (i.e. composition), is more complicated. Here we assume that y_i = x_i for all i, or in other words that we have a single basis for our single vector space. Given a vector z, we let z_i be the coefficient on x_i in the unique expansion z = ∑ z_i x_i. Now given transformations A and B, we wish to express the composite transformation AB as a matrix, by computing for each i,j the coefficient (AB)_ij that expresses how much z_j contributes to (ABz)_i.

We can compute these coefficients as follows. Start with the fact that

$(B\vec{z})_k = \sum_{j=1}^{n} B_{kj} z_j.$

Then

$\begin{eqnarray*} (A(B\vec{z}))_i &=& \sum_{k=1}^{n} A_{ik} (B\vec{z})_k \\ &=& \sum_{k=1}^{n} A_{ik} \sum_{j=1}^{n} B_{kj} z_j \\ &=& \sum_{j=1}^{n} \left(\sum_{k=1}^{n} A_{ik} B_{kj} \right) z_j. \end{eqnarray*}$

From this we can immediately read off that

$(AB)_{ij} = \sum_{k=1}^{n} A_{ik} B_{kj}.$

In visual terms, to get the element in the i-th row and j-th column of AB, we take the i-th row of A and the j-th column of B, line them up, multiply matching elements, and sum the result. This operation of taking the sum of pairwise products of vector elements is known as a dot product and is of central importance in linear algebra.

Vectors as matrices

We previously defined for a linear transformation A from an n-dimensional to an m-dimensional space and a vector z = ∑_jx_j the product Az as ∑_i (∑_j A_ijz_j) y_i, where the x_j and y_i are the basis vectors of the domain and codomain of A. If we drop the basis vectors and represent z as a matrix in the form of an n×1 column vector z with z_1j = z_j, then Az is an m×1 column vector with coordinates (Az)_1i = ∑_j A_ijz_j—precisely the coordinates of Az. So by representing vectors as matrices in column-vector form we can represent the application of a linear transformation as just another special case of matrix multiplication.

The transpose of a matrix

Given an n×m matrix A, define its transpose A^T (sometimes written as A') as the m×n matrix with coefficients (A^T)_ij = A_ji. In other words, in A^T we reverse the indices of A, effectively flipping the matrix diagonally.

This is an easy operation to define, but what does it mean? Given a n×1 column vector x, x^T is an 1×n row vector. Since it has n columns we can multiply it using ordinary matrix multiplication by an n×1 column vector y, say, giving (x^Ty)₁₁ = ∑_i x_1iy_i1. This is the dot product operation mentioned above, and is often abbreviated as x⋅y. Its interpretation is that taking the transpose of x converts it from a vector to a scalar-valued linear function on vectors. The set of all such scalar-valued linear functions on vectors is itself a vector space, called the dual space V^* of the original ("primal") vector space V. For finite-dimensional spaces the dual space is isomorphic to its primal space, since (x^T)^T = x for all x.

What about bigger matrices? If we take some n×m matrix A representing a linear transformation A:U→V and compute its m×n transpose A^T, we get a linear transformation from an m-dimensional space to an n-dimensional space. To understand what this matrix does, it is helpful to first look at the transpose of a product: ((AB)^T)_ik = (AB)_ki = ∑_j A_kjB_ji = ∑_j (B^T)_ij (A^T)_jk = (B^TA^T)_ik. Or stated more concisely: (AB)^T = B^TA^T. So now what does A^T do to some vector y? We have A^Ty = (y^TA)^T. If we think of y^T as an element of a dual space, y^TA is also an element of the same dual space (since whenever we can multiply y^Tx we can also multiply (y^TA)x = y^T(Ax)). So A^T acts like a linear transformation from V→U defined by a taking a detour through the dual space via A^Ty = (y^TA)^T.

The inverse of a matrix

Some matrices are invertible; some (including all the ones that aren't square) are not. To try to invert a matrix, we start with the pair of matrices A, I (where I is the identity matrix defined by I_ii = 1 and I_ij = 0 when i≠j), and multiply both sides of the pair from the left by a sequence of transformation matrices B₁, B₂, ... B_k until B_kB_k-1⋯B₁A = I. At this point the right-hand matrix will be B_kB_k-1⋯B₁ = A^-1. (We could just keep track of all the B_i, but it's easier to keep track of their product.)

How do we pick the B_i? These will be matrices that (a) multiply some row by a scalar, (b) add a multiple of one row to another row, or (c) swap two rows. We'll use the first kind to make all the diagonal entries equal one, and the second kind to get zeroes in all the off-diagonal entries. The third kind will be saved for emergencies, like getting a zero on the diagonal.

That the operations (a), (b), and (c) correspond to multiplying by a matrix is provable but tedious.¹ Given these operations, we can turn any invertible matrix A into I by working from the top down, rescaling each row i using a type (a) operation to make A_ii = 1, then using a type (b) operation to subtract A_ji times row i from each row j > i to zero out A_ji, then finally repeating the same process starting at the bottom to zero out all the entries above the diagonal. The only way this can fail is if we hit some A_ii = 0, which we can swap with a nonzero A_ji if one exists (using a type (c) operation). If all the rows from i on down have a zero in the i column, then the original matrix A is not invertible. This entire process is known as Gauss-Jordan_elimination.

This procedure can be used to solve systems of linear equations: if Ax = b, we can compute x by first computing A^-1 and then multiplying x = A^-1Ax = A^-1b. If we are not interested in A^-1 for its own sake, we can simplify things by substituting b for I during the Gauss-Jordan elimination procedure; at the end, it will be transformed to x.

Example

Original A is on the left, I on the right. We'll work in ℤ₅ just to make things more entertaining, and to avoid having to write down a lot of fractions. Recall that in ℤ₅, 2^-1 = 3, 1^-1 = 1, and 4^-1 = 4.

Initial matrices:

2 0 1    1 0 0
1 0 1    0 1 0
3 1 2    0 0 1

Multiply top row by 3:

1 0 3    3 0 0
1 0 1    0 1 0
3 1 2    0 0 1

Subract top row from middle row and 3*top row from bottom row:

1 0 3    3 0 0
0 0 3    2 1 0
0 1 3    1 0 1

Swap middle and bottom rows:

1 0 3    3 0 0
0 1 3    1 0 1
0 0 3    2 1 0

Multiply bottom row by 2:

1 0 3    3 0 0
0 1 3    1 0 1
0 0 1    4 2 0

Subtract 3*bottom row from top and middle rows:

1 0 0    1 4 0
0 1 0    4 4 1
0 0 1    4 2 0

and we're done. (It's probably worth multiplying the original A by the alleged A^-1 to make sure that we didn't make a mistake.)

Orthogonality

Two vectors x and y in a real vector space are called orthogonal if x⋅y = x^Ty = 0. This is a symmetric relation, since y⋅x = y^Tx = (x^Ty)^T = (x⋅y)^T = x⋅y since y⋅x and x⋅y are both scalars (disguised as 1×1 matrices). In Euclidean spaces like our familiar ℝ³, two vectors are orthogonal precisely if they are perpendicular: e.g. (0,0,1)⋅(1,1,0) = 0⋅1+0⋅1+1⋅0 = 0, so a vector sticking directly out of the page (if that's what our third coordinate represents) is orthogonal to a vector running diagonally across the page, just as we would suspect from the orthogonal = perpendicular rule. In other spaces defining orthogonality in this way is more confusing; for example, in (ℤ₂)³ the vectors (1,1,1) and (0,1,1) are orthogonal since 1⋅0+1⋅1+1⋅1 = 0+1+1 = 0, but it's not clear how we would draw these to make them look like they have a right angle between them, and the vectors with similar-looking coordinates in ℝ³ are not orthogonal since their dot-product is 2.

An important property of orthogonality is that it is preserved by scalar multiplication and vector addition. This follows from the facts that (ax)⋅y = a(x⋅y) and x⋅(y+z) = x⋅y + x⋅z, which are simply restatements of the fact that x^T is a linear transformation from the vector space to its underlying field. A consequence of this fact is that the set of vectors that are orthogonal to a particular vector x form a subspace.

Projections onto lines

Suppose that we can write a vector z = ax + y, where y is orthogonal to x. Then ax is called the projection of z onto the line (1-dimensional subspace) generated by x. The question arises: does such a projection always exist, and if it does, how do we find a?

The answer is that we take the dot-product of z with x. This gives z⋅x = ax⋅x + y⋅x = ax⋅x (since y⋅x = 0). Solving for a gives a = z⋅x/x⋅x, and it is immediate from this solution that a is in fact the unique scalar such that z = ax + y with y⋅x = 0.

Note that this only works if x⋅x is nonzero. For vector spaces over the rationals or the reals this is always the case when x itself is nonzero. Here x⋅x is just the square of the length of x as obtained by the Pythagorean theorem. But it may fail to be the case in spaces over fields with finite characteristic; for example, in ℤ₂³, (0,1,1)⋅(0,1,1) = 0, and there is no projection of z = (1,0,1) onto the line generated by (0,1,1).² It can also fail in vector spaces over the complex numbers if we do nothing special about the dot product: when x = (i, 1) ∈ ℂ² then x⋅x = i²+1² = -1 + 1 = 0. For this reason the dot-product x⋅y in complex spaces is usually defined by first taking the complex conjugate of x: the vector obtained by substituting -i for each occurrence of i in x. This gives a more reasonable measure of length (now (i,1)⋅(i,1) = -i⋅i + 1 = 2) and in fact allows projections to be taken in such spaces. But the easier solution—which we will adopt—is simply to stick with real-valued vector spaces when we talk about dot products.

Here's a picture of projection in magnificent ASCII_art:

y      z = ax + y
⋅------⋅
|     /|
|    / |
|   /  |
|  /   |
| /    |
|/     |
⋅------⋅-------⋅
0     ax      x

One interpretation of the projection of z onto x is that a is the least-squares solution to the problem ax = z; that is, it is the solution that minimizes the sum of the squares of the errors in each coordinate (which is just y⋅y, the square of the length of y = z - ax). This is useful for approximating the solution to inconsistent equations. For example, given the equations 2a = 3 and 3a = 4, we can rewrite them as a(2,3) = (3,4) and obtain a = (2,3)⋅(3,4)/(2,3)⋅(2,3) = (6+12)/(4+9) = 18/13. This is not actually a solution, since 2(18/13) = 2.76... is not 3 and 3(18/13) = 4.14... is not 4, but it's pretty close and it is impossible to do better without increasing the sum of the squares of the errors.

Projections onto higher-dimensional subspaces

What if we want to project onto a subspace that's not a line? We can still do so, but it's helpful to first generate an orthogonal basis for the subspace. This is a basis (maximal set of independent vectors) that is orthogonal in the sense that any two basis elements x_i and x_j are orthogonal.

Suppose we have an orthogonal basis, and we want to write z = y + ∑_i a_ix_i, where y is orthogonal to all of the x_i. To find a particular a_i, observe that y + ∑_j≠i a_jx_j is orthogonal to x_i. So in fact z = a_ix_i + y + ∑_j≠i a_jx_j is just an instance of the problem of projecting onto a line. We've already solved that problem: a_i = z⋅x_i/x_i⋅x_i.

What if we don't have an orthogonal basis? Presumably we have some sort of basis, say x₁...x_k. If it's not orthogonal, we can make it orthogonal via projection: let y₁ = x₁, and for each i > 1 let y_i be the solution to x_i = y_i + ∑_{j < i} a_j y_j where y_i is orthogonal to all the y_j for j < i. Now use the y_i as our orthogonal basis and project away.

(In real life, projection problems like this are solved by constructing a matrix A whose columns are the x_i vectors, and solving for a = (A^TA)^-1A^Tz. The proof that this works is similar to the 1-dimensional case, and the resulting a gives the least-squares solution to the equation Aa = z.)