Hacker News 1h ago

Deriving SVD without even aiming at it

Connections in Math: deriving the SVD from scratch

Disclaimer: no AI was used to write this. Any errors, awkward sentences, and weird tangents are 100% organic, free-range, and human-made.

How to actually learn anything?

For a long time I have always been asking myself what it really means to learn or grasp something in math. I think this can be extended to other areas, but math is very often particularly distant from daily ordinary life when it comes to many of its concepts, and it's hard for me (at least) to have an intuitive and natural understanding of it - the type of understanding that gives you some sort of foundational structure of the concepts and allows you to just reconstruct its details by yourself later in the future, with none or limited consultation.

I have experienced a few times this feeling of having really captured some sort of general structure in some areas, that allows me to express myself or deduce facts only consulting what I currently understand about it. One good example is my first experience with Real Analysis. At first, it looked very confusing and difficult, but after a while you capture some sort of structure of the main concepts and you can mostly run the concepts yourself.

But one thing took me too long to understand: traditional math books very often fail to give you a path that allows you to capture structure, or to grasp concepts in a more intuitive or natural way (that would allow you to reconstruct it later). This is because most math books present math in its final formalized version. But they hide the path that was taken to get there: often messy, experimental, trial-and-error and tentative.

This blog assumes you already have basic knowledge in math. It's just not practical to teach everything from the ground up, but my goal is to write here some of the connections that helped me grasp concepts.

Starting Somewhere

I've chosen to start from Linear Algebra (L.A). It's one of the most applicable and accessible areas in math. When I was introduced to a central Linear Algebra concept called Singular Value Decomposition (SVD), it took me time to really understand what it is, and I think most books fail to explain it in a simple way simply because they usually start from the final conclusion formally stated, which makes no sense to the reader.

I think it's critical to introduce motivation on why this was created in the first place. Most things in applied math came from a specific problem someone was trying to solve. Linear Algebra is a great place to start not only because it's somehow simpler than more advanced non-linear areas, but also because it's connected to so many other areas, such as calculus, information theory, image processing, machine learning, and many others. It's some sort of basic or central area, although I think no matter where you start, if you continue to dig enough, you will reach connections to other areas - and in L.A this is even more evident.

Let's arrive naturally at SVD without even aiming at it.

The matrix of a Linear Transformation can be written differently for the same L.T depending on the chosen input and output bases. To see that, take a deliberately simple linear map - stretch the -axis by 3, leave alone:

In the standard basis, its matrix just reads off where the basis vectors land:

Nothing could be clearer: "stretch one axis."

Now describe the same map in a skewed basis: choose a slanted, non-orthonormal basis:

The matrix of the same transformation in this basis is :

Same transformation. Same map on every actual vector. But now it looks like an arbitrary tangle of scaling, shearing, and sign flips - you'd never guess it was just "stretch by 3." The complexity was never in the map. It was in the basis.

Sometimes we can diagonalize a matrix, by just choosing a new basis that makes the matrix of diagonal (this can only be done if admits an Eigen Vector basis, but we will leave this fact unproved for now). Of course, not all L.T allow for a basis of Eigen Vectors. In fact, some L.T are not even square (e.g. input dimension is not the same as output dimension). We cannot even talk about Eigen Vectors for L.Ts that don't output to the same dimensional space, because an Eigen Vector stays fixed up to a scale, so it obviously restricts input and output to the same dimension.

When we can diagonalize a matrix, something interesting happens: we can clearly see that the Linear Transformation itself is basically a simple non-uniform scale, disguised as something more complicated due to the basis chosen to represent the matrix (the opposite of the operation we did before, where we showed a simple scale can be represented in a more complicated matrix form). So we take the matrix of a diagonalizable L.T , and write:

and then is similar to a diagonal matrix. So the potentially complicated is just a few numbers on the matrix's diagonal and zeros everywhere else, up to a special change of basis matrix . This is perfect for storing in a computer and to perform matrix multiplication efficiently. We could convert our data to the reference frame of once, make millions of matrix multiplications using a diagonal matrix (easy), then convert back to the original frame using . This simplification is possible because we could extract structure from the L.T .

When do we fail to diagonalize a matrix?

Well, if a diagonalizable matrix is just a disguised scale in some special coordinate frame, then we fail to diagonalize when the L.T is anything more than a disguised scale (actually, this is equivalent to saying that cannot admit a basis of Eigen Vectors - try to prove that yourself). For example, if is a 2-d rotation, it does not have two independent Eigen Vectors (it has no real eigenvectors at all; a rotation moves every direction in the plane!). If is a 2-d shear, it surely cannot be diagonalized either, for a similar reason.

So can we still extract structure from just any L.T in the same way we did with diagonalizable L.Ts? If we could, then we might get some advantages, as we did with diagonal matrices…

Intuitively, we can just look at what happens to an -dimensional unit sphere through a generic L.T: it can be non-injective, that is, some directions might be annihilated (collapsed to 0 - so a sphere would flatten into a disk); it can scale other directions by an arbitrary value (so a sphere or a disk would become an ellipsoid or ellipse); it might rotate other directions by some arbitrary angle through arbitrary axes; it might also shear the sphere into other directions (a sheared sphere is still a rotated ellipsoid).

This simple fact can be written as below:

where is a scalar - the stretch factor - and is a unit output direction.

That's it. We have detached and , and we don't demand that they are Eigen Vectors anymore; they are just arbitrary vectors. The expression above introduces no new fact, it's just what happens to any L.T, but it indeed captures our feeling that unit spheres are mapped to rotated ellipsoids in some way, if you see as scale factors and as the new rotated directions.

Our feeling now is that a L.T might perform several types of operations (rotation, shear, scale) that, similarly to our diagonalization case, can possibly be decomposed and isolated, revealing some simpler structure in what initially seemed a complicated transformation.

As we saw earlier, skewed bases are not the best to represent a L.T matrix, because they mix the coordinate basis vectors: each basis vector has a non-zero projection onto the others. The obvious next step here is to see what happens when the bases and are orthonormal.

Can we always ask and to be orthonormal? We will see we can, and that's an interesting and neat fact.

Look at what we are asking. For the output vectors to be orthonormal:

and at the same time we want the input vectors to be orthonormal:

Now, something very interesting happens. Rewrite the inner product of the outputs by moving to the other side (using ):

So far, nothing new - just rewriting the exact same expression. But now shows up explicitly, and it's a symmetric matrix! Take any (potentially non-square) matrix and form , and you will end up with a symmetric matrix.

Now, a very important fact here is that a symmetric matrix always admits an orthonormal basis of Eigen Vectors (this is the spectral theorem). So let us choose the to be exactly the orthonormal Eigen Vectors of . This immediately gives us (2), the input orthonormality, for free.

But here is the key point: this choice also gives us (1), the output orthogonality, automatically. Watch what happens to the off-diagonal case . Since is an eigenvector of , we have , so:

The outputs are orthogonal precisely because the inputs were orthogonal eigenvectors of . We did not have to demand the output orthogonality separately - it fell out of the input choice. And the lengths of those outputs are exactly the scale factors: , so , and normalizing gives the orthonormal output directions .

So we could not, in general, find an orthonormal basis that itself diagonalizes - but we can always find one for the symmetric matrix , and that is exactly the input basis we were looking for.

The rank of A and the rank of AᵀA

Another remarkable fact is that the rank of is the same as the rank of , and we will need it soon, so let's prove it first.

Consider . By the rank–nullity theorem,

where is the input dimension (the kernel lives in the input space ).

Now notice that is (since is and is ), so it also maps , and the same rank–nullity theorem gives

Both maps have the same domain dimension , so to prove the ranks are equal it is enough to prove the kernels are equal:

Proof.

(1) First, - the easy direction:

(2) Now - the less trivial direction. At first it seems nothing stops a vector that is not in the kernel of from having land in the kernel of , in which case even though . But it turns out this cannot happen.

Suppose . Then , so . Two cases:

either , so and we are done;
or , which would put a nonzero vector in both (since is, by definition, an output of ) and .

But this second case is impossible, because (a subspace and its orthogonal complement share only the zero vector). So , i.e. .

A simpler, self-contained way to prove this second direction is to use the positive-definiteness of the inner product directly:

The last step is exactly positive-definiteness: only the zero vector has zero length. Either route works; the first connects to the four fundamental subspaces, the second goes straight to the inner product.

Taking stock: what we managed to extract

Before assembling anything, let's pause and list what we now know about any linear map (an matrix). We set out to do for a general map what diagonalization did for the nice ones - extract structure - and it turns out we can extract quite a lot:

Orthonormal frames on both sides. Because is symmetric, the spectral theorem hands us an orthonormal basis of input directions (its eigenvectors). Pushing them through and normalizing gives an orthonormal basis of output directions . Both frames clean, no skew.
A rank that lines up. We proved . So exactly of the eigenvalues are nonzero, giving exactly nonzero scale factors . The other input directions are the kernel - sends them to zero.
A clean bijection in the middle. Those surviving input directions are linearly independent (they are part of an orthonormal basis), and they map to linearly independent output directions, one for one: . On the directions that matter, is just a clean, invertible scaling between two orthonormal frames - everything else it annihilates.

That last point is the heart of it. Stripped of the kernel, does nothing exotic: it takes perpendicular input directions to perpendicular output directions, stretching each by its own factor . The "complicated transformation" was, once again, a simple scaling in disguise - we just needed two orthonormal frames instead of one to see it.

Collecting it into a factorization

Now we collect these relations into a single matrix equation. Notice this is exactly the same relation we wrote down at the very start - only now we know the and can be chosen orthonormal, and the scale is .

We have, one per surviving direction:

with orthonormal in the input space and orthonormal in the output space.

Put the as the columns of () and the as the columns of (). Applying to each column of is just , and the right-hand side - each scaled by - is with its columns scaled, which is exactly where the diagonal comes from:

So .

We are not collecting two matrices and then separately decomposing a third - the diagonal appears on its own, simply because scaling each column of by its factor is the same as multiplying on the right by a diagonal matrix. The per-direction scales collapse onto the diagonal for free.

To isolate , multiply on the right by :

Here we must be careful: has orthonormal columns, so , but is not the full identity - it is the orthogonal projection onto the span of the . That span is the row space of , and its complement is the kernel, which kills anyway. So projecting it away changes nothing, , and we arrive at three width- pieces: an orthonormal output frame , a diagonal of stretches , and an orthonormal input frame .

The same "extract the structure" move we made for diagonalizable maps - now done for any linear map at all.

The geometry behind it: the four subspaces

We leaned on one fact in the rank proof - that - without really looking at it. It is worth pausing on, because it carries a lot of geometric insight, and it explains why the kernel directions could be dropped so cleanly when we built the thin SVD.

A linear map splits each of its two spaces into an alive part and a dead part. In the input space , the alive part is the row space (the directions actually acts on) and the dead part is the kernel (the directions sends to zero). In the output space , the alive part is the image (the directions can reach) and the dead part is (the directions no input ever maps onto).

And

Read on Hacker News ↗ ← Back to News

Deriving SVD without even aiming at it