Principal Components Analysis
Department of Statistics and Applied Probability; UCSB
Summer Session A, 2025
\[ \newcommand\R{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\E}{\mathbb{E}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\F}{\mathcal{F}} \newcommand{\1}{1\!\!1} \newcommand{\comp}[1]{#1^{\complement}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\SD}{\mathrm{SD}} \newcommand{\vect}[1]{\vec{\boldsymbol{#1}}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\tmat}[1]{\mathbf{#1}^{\mathsf{T}}} \newcommand{\vect}[1]{\vec{\boldsymbol{#1}}} \newcommand{\tvect}[1]{\vec{\boldsymbol{#1}}^{\mathsf{T}}} \DeclareMathOperator*{\argmax}{\mathrm{arg} \ \max} \]
Data may be high dimensional in the sense of containing many variables.
It is possible, though, to reduce the dimensionality of a dataset while not losing too much information.
More formally, we considered the idea of projecting our data matrix into a smaller-dimensional subspace so as to preserve as much variance as possible.
Given a mean-centered data matrix \(\mat{X}\), we saw that the unit vector \(\vect{v}\) such that the projected data \(\mat{X} \vect{v}\) has maximal variance is the eigenvector of \((\tmat{X} \mat{X})\) with the largest associated eigenvalue.
For example, suppose we have a 6-dimensional data matrix \(\mat{X}\) (i.e. a matrix with 6 columns), assumed to be mean-centered.
A two-dimensional representation of \(\mat{X}\) would be found by:
Admittedly, the terminology surrounding PCA is a bit varied.
Here is the terminology we’ll adopt for this class:
Again, some people use the term “Principal Component” to mean other things - we’re adopting the above terminology for this class.
Over the next few slides, we’re going to establish some linear algebra results that connect various properties of matrices.
A question we will often have to contend with is: is our data mean-centered or not?
I’ll do my best to be explicit about whether or not we are assuming mean-centered data or not.
As a general rule-of-thumb: most of the linear algebra results we will derive hold for any arbitrary data matrix, mean-centered or not.
Let’s start by assuming an arbitrary (not necessarily mean-centered) data matrix \(\mat{X}\). How can we go about finding the eigenvalues and eigenvectors of \((\tmat{X} \mat{X})\)?
Recall (from Math 4A and Homework 1) that there are two main ways to decompose a matrix: eigendecomposition (EVD) and singular value decomposition (SVD).
Also recall that the SVD can be thought of as a “generalization” of the EVD.
$d
[1] 4.8348990 1.2637105 0.1636685
$u
[,1] [,2] [,3]
[1,] -0.7659032 -0.4194393 -0.4873018
[2,] -0.2615772 -0.4890773 0.8320942
[3,] -0.5873412 0.7647706 0.2648701
$v
[,1] [,2] [,3]
[1,] -0.4013705 0.8784465 0.2592944
[2,] -0.4924042 -0.4456599 0.7476131
[3,] -0.7722952 -0.1723921 -0.6114255
\[ \begin{align*} \begin{pmatrix} 1 & 2 & 3 \\ 0 & 1 & 1 \\ 2 & 1 & 2 \\ \end{pmatrix} & = \left(\begin{array}{rrr} -0.766 & -0.419 & -0.487 \\ -0.262 & -0.489 & 0.832 \\ -0.587 & 0.765 & 0.265 \\ \end{array} \right) \begin{pmatrix} 4.835 & 0 & 0 \\ 0 & 1.264 & 0 \\ 0 & 0 & 0.164 \end{pmatrix} \left( \begin{array}{rrr} -0.401 & -0.492 & -0.772 \\ 0.878 & -0.446 & -0.172 \\ 0.259 & 0.748 & -0.611 \end{array} \right) \\[3mm] \mat{X} \hspace{12mm} & = \hspace{35mm} \mat{U} \hspace{68mm} \mat{\Sigma} \hspace{69mm} \tmat{V} \\ \end{align*} \]
Moral
If the SVD of \(\mat{X}\) is given by \(\mat{X} = \mat{U} \mat{\Sigma} \tmat{V}\), then \(\mat{\Sigma}^2\) is the matrix of eigenvalues of \((\tmat{X} \mat{X})\) and \(\mat{V}\) is the matrix of eigenvectors of \((\tmat{X} \mat{X})\).
prcomp()
function in R
.
prcomp()
.\[\begin{align*} \mat{X} \mat{V} & := \mat{Z} \\ \mat{X} \mat{V} \tmat{V} & = \mat{Z} \tmat{V} \\ \mat{X} & = \mat{Z} \tmat{V} \end{align*}\]
Perhaps I should explain a bit further what I mean by a “reconstruction” of \(\mat{X}\).
By properties of matrix multiplication, \(\mat{X}_d\) will be of size (n × p), i.e. the same size as the original \(\mat{X}\).
However, since \(\mat{V}_d\) is of rank d, \(\mat{X}_d\) will be of rank d.
So, this is what I mean by a “low-rank reconstruction” of \(\mat{X}\): a matrix whose dimensions are the same as \(\mat{X}\), but whose rank is smaller than that of \(\mat{X}\).
We can define the reconstruction error to be \(\mathrm{R}(\mat{X}_d, \mat{X}) := \| \mat{X}_d - \mat{X} \|^2\), where \(\| \cdot \|\) denotes an appropriately-defined matrix norm.
Whew, that’s a lot of math!
Let’s work through an example together. We’ll start off with \[ \mat{X} = \begin{pmatrix} 1 & 2 & 3 & 1 \\ 2 & 0 & 1 & 2 \\ 3 & 0 & 0 & 1 \\ 2 & 1 & 1 & 0 \\ 0 & 1 & 0 & 0 \\ 2 & 1 & 1 & 1 \\ \end{pmatrix} \]
[,1] [,2] [,3] [,4]
[1,] -0.6666667 1.1666667 2 0.1666667
[2,] 0.3333333 -0.8333333 0 1.1666667
[3,] 1.3333333 -0.8333333 -1 0.1666667
[4,] 0.3333333 0.1666667 0 -0.8333333
[5,] -1.6666667 0.1666667 -1 -0.8333333
[6,] 0.3333333 0.1666667 0 0.1666667
attr(,"scaled:center")
[1] 1.6666667 0.8333333 1.0000000 0.8333333
Multiply \(\mat{X}\) by the matrix containing the first two principal components:
Backwards-project using the first two principal components:
[,1] [,2] [,3] [,4]
[1,] -0.6824534 1.15684238 1.99860514 0.1849387
[2,] 0.9013719 -0.43233075 0.01912272 0.5477820
[3,] 1.1362868 -0.97030623 -1.00802682 0.3830813
[4,] -0.1577576 0.01711215 -0.14544752 -0.1382030
[5,] -1.4188787 0.29648176 -0.96215851 -1.1399328
[6,] 0.2214310 -0.06779931 0.09790498 0.1623338
[,1] [,2] [,3] [,4]
[1,] -0.6666667 1.1666667 2 0.1666667
[2,] 0.3333333 -0.8333333 0 1.1666667
[3,] 1.3333333 -0.8333333 -1 0.1666667
[4,] 0.3333333 0.1666667 0 -0.8333333
[5,] -1.6666667 0.1666667 -1 -0.8333333
[6,] 0.3333333 0.1666667 0 0.1666667
attr(,"scaled:center")
[1] 1.6666667 0.8333333 1.0000000 0.8333333
Multiply \(\mat{X}\) by the matrix containing the first three principal components:
Backwards-project using the first two principal components:
[,1] [,2] [,3] [,4]
[1,] -0.6668552 1.16553247 2.00064836 0.16605904
[2,] 0.3488162 -0.74016994 -0.05325683 1.21657701
[3,] 1.3283527 -0.86330271 -0.98286801 0.15061120
[4,] 0.3560513 0.30336477 -0.07814344 -0.76010018
[5,] -1.6681744 0.15759445 -0.99481387 -0.83819359
[6,] 0.3018094 -0.02301903 0.10843378 0.06504651
[,1] [,2] [,3] [,4]
[1,] -0.6666667 1.1666667 2 0.1666667
[2,] 0.3333333 -0.8333333 0 1.1666667
[3,] 1.3333333 -0.8333333 -1 0.1666667
[4,] 0.3333333 0.1666667 0 -0.8333333
[5,] -1.6666667 0.1666667 -1 -0.8333333
[6,] 0.3333333 0.1666667 0 0.1666667
attr(,"scaled:center")
[1] 1.6666667 0.8333333 1.0000000 0.8333333
Multiply \(\mat{X}\) by the matrix containing the four three principal components:
PC1 PC2 PC3 PC4
[1,] -2.22141095 0.9476208 0.02606604 0.001453105
[2,] 0.78993734 0.8220772 -0.92336792 -0.119358571
[3,] 1.82811089 0.2313481 0.32095854 0.038396007
[4,] -0.02366601 -0.2547062 0.85861868 -0.175134144
[5,] -0.49186600 -2.0209854 -0.41659441 0.011623098
[6,] 0.11889473 0.2746455 0.13431906 0.243020505
Backwards-project using the first two principal components:
[,1] [,2] [,3] [,4]
[1,] -0.6666667 1.1666667 2.000000e+00 0.1666667
[2,] 0.3333333 -0.8333333 7.535993e-17 1.1666667
[3,] 1.3333333 -0.8333333 -1.000000e+00 0.1666667
[4,] 0.3333333 0.1666667 -3.215917e-17 -0.8333333
[5,] -1.6666667 0.1666667 -1.000000e+00 -0.8333333
[6,] 0.3333333 0.1666667 -3.564466e-17 0.1666667
[,1] [,2] [,3] [,4]
[1,] -0.6666667 1.1666667 2 0.1666667
[2,] 0.3333333 -0.8333333 0 1.1666667
[3,] 1.3333333 -0.8333333 -1 0.1666667
[4,] 0.3333333 0.1666667 0 -0.8333333
[5,] -1.6666667 0.1666667 -1 -0.8333333
[6,] 0.3333333 0.1666667 0 0.1666667
attr(,"scaled:center")
[1] 1.6666667 0.8333333 1.0000000 0.8333333
So, as the last demo illustrated: as we increase the dimension onto which we are projecting, the reconstructed matrix will become more and more similar to the original matrix.
As always, there’s a tradeoff.
To answer this qeustion, we’ll go back to a fact from yesterday’s lecture: the variance of the data projected along the kth principal component is proportional to \(\lambda_k\), the associated eigenvalue.
Therefore, the proportion of the total variance captured by the kth principal component (i.e. the proportion of the total variance present in the variance of the data projected along the kth PC) is given by \[ s_k := \frac{\lambda_k}{\sum_{k} \lambda_k} = \frac{\sigma_k^2}{\sum_{k} \sigma_k^2} \] where \(\sigma_k\) is the kth singular value of \(\mat{X}\).
A plot of sk vs k is called a screeplot, named after a particular rock formation.
set.seed(100) ## for reproducibility
S_Mat <- toeplitz(c(10, rep(1, 5)))
X <- mvrnorm(n = 10, mu = rep(0, 6), Sigma = S_Mat)
PCA_X <- prcomp(X, scale. = TRUE)
s_k <- PCA_X$sdev^2 / sum(PCA_X$sdev^2)
data.frame(k = 1:ncol(X), y = s_k) %>%
ggplot(aes(x = k, y = s_k)) +
geom_point(size = 5) + geom_line(linewidth = 1) +
theme_minimal(base_size = 24) + ylab("prop. var") +
ggtitle("Example Screeplot")
The first five PCs capture around nearly 94% of the total variance!
So, for this matrix, around 5 dimensions is sufficient; the sixth contributed very little toward the total variance.
Live Demo
Time for another live demo! Feel free to boot up your laptops and follow along. In this demo we’ll take a look at the notion of reconstruction error, which essentially is a measure of how poorly our reconstructed matrix is doing at approximating the original matrix.
Background: The MNIST (Modified National Institute of Standards and Technology) database contains around 70,000 handwritten digits, collected from a combination of high school students and US Census Bureau employees. Each digit is stored as a 28px by 28px image, with an additional classifier label (indicating what digit the image is supposed to be).
PSTAT 100 - Data Science: Concepts and Analysis, Summer 2025 with Ethan P. Marzban