The Geometry of Data
Department of Statistics and Applied Probability; UCSB
Summer Session A, 2025
\[ \newcommand\R{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\E}{\mathbb{E}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\F}{\mathcal{F}} \newcommand{\1}{1\!\!1} \newcommand{\comp}[1]{#1^{\complement}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\SD}{\mathrm{SD}} \newcommand{\vect}[1]{\vec{\boldsymbol{#1}}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\tmat}[1]{\mathbf{#1}^{\mathsf{T}}} \newcommand{\vect}[1]{\vec{\boldsymbol{#1}}} \newcommand{\tvect}[1]{\vec{\boldsymbol{#1}}^{\mathsf{T}}} \DeclareMathOperator*{\argmax}{\mathrm{arg} \ \max} \]
How many of you have heard the term big data?
At the most basic level, big data is data that is big. But what do we really mean by “big,” especially seeing as datasets are comprised by both the number of observations and the number of variables?
Essentially there are three situations to consider:
With the continued improvement of computers and computing, the first two cases on the previous slide are not of as great concern as they were, say, 10 years ago.
However, the third case is still an active area of research.
So, a natural question arises: given a dataset, are all dimensions (variables?) really necessary to convey all of the necessary information?
The answer, in some cases, turns out to be “no.”
We’ll spend the better part of two lectures addressing the details behind this answer.
There are a few things we’ll need to discuss first.
The very first thing we’ll do today is to establish a more mathematical framework for discussing data.
First, let’s take a brief interlude to talk about some numerical summaries of data.
Given a list x
= (x
1, …, x
n) of n numbers, there are a series of numerical summaries we can provide.
First question: what is the “center” (or “most typical value”) of x
? (Relates to measures of central tendencey)
Two main answers: median and mean.
To find the median, line up the data in ascending order and then tick off the first and last elements; repeat this process until you are either left with one value (the median), or two values (in which case you add these two values and divide by 2 to obtain the median).
For example, the median of the set \(\{-1, 4, 5, 6, 10\}\) is 5 and the median of the set \(\{-1, 4, 5, 6, 10, 12\}\) is 5.5 (I encourage you to do this computation on your own, as practice).
The definition of the sample mean is \[ \overline{x}_n := \frac{1}{n} \sum_{i=1}^{n} x_i \]
So, for example, the mean of \(\{-1, 4, 5, 6, 10\}\) is 4.8, which we calculated by summing up all five elements and then dividing by 5.
R
, we use mean()
and median()
to compute the mean and median (respectively) of a set of numbers, as we saw in Lab01.To express how “spread out,” or “variable,” \(\vect{x}\) is, there are three common measures:
The variance is perhaps most commonly used, as it has a simple interpretation as the “average distance from the center.”
R
, these are computed using diff(range())
, var()
, and IQR()
, respectively.Your Turn!
Consider the list of numbers x
= \((1.1, \ 2.4, \ 5.6, \ 7.8, \ 100.1)\).
Calculate both the mean and median of x
. Which do you think is a “better” description of the central value of x
?
Calculate the range, standard deviation (square root of variance), and IQR of x
. Which do you think is a “better” description of the spread of x
?
04:00
Name | Height (in) | Weight (lbs) |
---|---|---|
Alex | 61.5 | 130.3 |
Biyonka | 72.4 | 180.6 |
Catherine | 58.4 | 86.7 |
Name
column and any extraneous formatting information (cell borders, column titles, etc.), what mathematical object are we left with?
More generally, consider a dataset with n observations and p variables.
The data matrix is the (n × p) matrix \(\mat{X} = \{x_{ij}\}\) such that xij is the ith observation on the jth variable.
As indicated in the diagram on the previous slide, there are two ways to think of a matrix: we can call these “row-wise” and “column-wise” viewpoints
For illustrative purposes, consider again our mock height-weight dataset from before: \[ \mat{X} = \begin{pmatrix} 61.5 & 130.3 \\ 72.4 & 180.6 \\ 58.4 & 86.7 \\ \end{pmatrix} \]
The row-wise viewpoint says: our dataset is comprised of three transposed vectors in \(\R^2\): \[ \left\{ \begin{pmatrix} 61.5 \\ 130.3 \\ \end{pmatrix}, \ \begin{pmatrix} 72.5 \\ 180.6 \\ \end{pmatrix} , \ \begin{pmatrix} 58.4 \\ 86.7 \\ \end{pmatrix} \right\} \]
Cloud of Individuals: each point represents an individual in the dataset.
Cloud of Variables: each point represents the direction of a variable in the dataset (“how can the variable be described based on the individuals?”).
There isn’t necessarily one viewpoint (row-wise or column-wise) that is always “better” than the other.
When dealing with numerical considerations, however, the column-wise viewpoint is often preferred.
Firstly, all values in a column of a data matrix will be of the same units; this is not the case of the values across a row of the data matrix.
Secondly, the column-wise viewpoint allows us to compare variables as opposed to comparing units.
So, let’s stick with the column-wise viewpoint for now.
It turns out that many of the summary statistics we talked about earlier today have very nice correspondences with quantities from linear algebra.
As an example, consider a vector \(\vect{x} = (x_1, \cdots, x_n)^{\mathsf{T}}\) that represents a column from a particular data matrix (i.e. a variable in the cloud of variables). Further suppose that \(\overline{x}_n = 0\) (i.e. that the data is mean-centered)
We then have that \[ \| \vect{x} \| := \sum_{i=1}^{n} x_{i}^2 \ \stackrel{(\overline{x}_n = 0)}{=} \ \sum_{i=1}^{n} (x_i - \overline{x}_n)^2 \]
For two vectors \(\vect{x} = (x_1, \cdots, x_k)^{\mathsf{T}}\) and \(\vect{y} = (y_1, \cdots, y_k)^{\mathsf{T}}\), their dot product \[ \langle \vect{x}, \vect{y} \rangle := \vect{x} \cdot \vect{y} = \sum_{i=1}^{n} x_i y_i \] can be interpreted in terms of the sample covariance between two sets of observations.
The mean is related to the inner product between \(\vect{x}\) and the unity vector: \[ \langle \vect{x} , \vect{1} \rangle = \begin{pmatrix} x_1 & \cdots & x_n \\ \end{pmatrix} \begin{pmatrix} 1 \\ \vdots \\ 1 \\ \end{pmatrix} = \sum_{i=1}^{n} x_i\]
So, again, we see that many of our familiar “statistical” quantities have direct correspondences with Linear Algebra quantities - this is one of the reasons Linear Algebra is so important in Statistics!
Furthermore, this connection allows us to (in the column-wise viewpoint) obtain summaries of our variables by performing familiar geometric operations.
Speaking of Linear Algebra (and slowly turning our attention back to the initial question from the start of today’s lecture), let’s discuss the “dimensionality” of a dataset.
Height (in) | Weight (lbs) |
---|---|
61.5 | 130.3 |
72.4 | 180.6 |
58.4 | 86.7 |
This dataset seems to have 2 dimensions
Height (in) | Height (cm) |
---|---|
61.5 | 156.21 |
72.4 | 183.90 |
58.4 | 148.34 |
How many dimensions does this data have?
(How many variables contribute new information?)
height
) - just because we’ve written it down twice doesn’t mean we’ve gained any new information.How many dimensions does this data have?
How many dimensions does this data have?
How many dimensions does this data have?
Only 1!
Linear Algebra: 2
Data Science: 1
Linear Algebra: 2
Data Science: 1
height
and then height squared
doesn’t really give two new variables!Alright, so let’s return to our “big data problem” from the start of today’s lecture.
Already we can perhaps see some justification for why I said, in certain cases, not all variables are needed to convey the full story of a dataset.
Height (in) | Height (cm) |
---|---|
61.5 | 156.21 |
72.4 | 183.90 |
58.4 | 148.34 |
Goal: Dimension Reduction
Reduce the dimension of a dataset with as little loss of information as possible.
Two questions arise:
Indeed, one (admittedly crude) form of dimension reduction is to simply remove one or more columns from a dataset!
But, a slightly more creative approach involves leveraging projections.
Seems like we might want to preserve as much variance as possible!
Goal
Identify the directions (in the cloud of variables) along which there is maximal variance. Then, project onto a subspace spanned by these directions to obtain a low-dimensional representation of the data.
We’ll set up some of the math today, and continue our discussion tomorrow.
First, let’s work toward a more specific goal:
Goal
Identify the vector \(\vect{v}\) such that \(\mat{X} \vect{v}\) (the column mean-centered data projected onto \(\vect{v}\)) has maximum variance (when compared to all other such vectors).
R
, we can accomplish this using the scale()
function.Goal
Identify the unit vector \(\vect{v}\) such that \(\mat{X} \vect{v}\) (the column mean-centered data projected onto \(\vect{v}\)) has maximum variance.
\[ \argmax_{\vect{v}} \left\{ \tvect{v} \tmat{X} \mat{X} \vect{v} \right\} \quad \text{s.t.} \quad \tvect{v} \vect{v} = 1 \]
Constrained optimization problems like this are most often solved using Lagrange Multipliers, which we will discuss further in a future lab.
We first construct the lagrangian \(\mathcal{L}(\vect{v}, \lambda) := \tvect{v} \tmat{X} \mat{X} \vect{v} - \lambda \tvect{v} \vect{v}\).
Then we differentiate wrt. \(\vect{v}\) and set equal to zero: \[ \tmat{X} \mat{X} \vect{v} - \lambda \vect{v} = 0 \] or, in other words, \((\tmat{X} \mat{X}) \vect{v} = \lambda \vect{v}\).
Goal
Identify the vector \(\vect{v}\) such that \(\mat{X} \vect{v}\) (the column mean-centered data projected onto \(\vect{v}\)) has maximum variance.
Result
The vector \(\vect{v}\) is given by an eigenvector of \(\tmat{X} \mat{X}\) with eigenvalue \(\lambda\).
In other words, the eigenvectors \(\vect{v}\) give the directions of maximal variance, and the eigenvalues \(\lambda\) give the amount of variance the projected data will have (all up to proportionality constants).
So, the eigenvector \(\vect{v}_1\) associated with the largest eigenvalue is the direction with the largest variance; the eigenvector \(\vect{v}_2\) associated with the second-largest eigenvalue is the direction with the second-largest variance; etc.
Pretty neat, huh?
R
?PSTAT 100 - Data Science: Concepts and Analysis, Summer 2025 with Ethan P. Marzban