Clustering; Introduction to Missing Data
Department of Statistics and Applied Probability; UCSB
Summer Session A, 2025
\[ \newcommand\R{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\E}{\mathbb{E}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\F}{\mathcal{F}} \newcommand{\1}{1\!\!1} \newcommand{\comp}[1]{#1^{\complement}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\SD}{\mathrm{SD}} \newcommand{\vect}[1]{\vec{\boldsymbol{#1}}} \newcommand{\tvect}[1]{\vec{\boldsymbol{#1}}^{\mathsf{T}}} \newcommand{\hvect}[1]{\widehat{\boldsymbol{#1}}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\tmat}[1]{\mathbf{#1}^{\mathsf{T}}} \newcommand{\Cov}{\mathrm{Cov}} \DeclareMathOperator*{\argmin}{\mathrm{arg} \ \min} \newcommand{\iid}{\stackrel{\mathrm{i.i.d.}}{\sim}} \]
Congrats on finishing the last ICA!
Grades will be released shortly after lecture today - allow me to say a few words.
Also, looking forward: please don’t forget to continue to work on your Final Projects!
Though I haven’t made explicit mention about this yet, there exists a broad division of statistical learning into supervised and unsupervised learning.
Supervised learning is where we have a response variable, whose relationship with one or more covariates we are trying to learn (hence the name statistical “learning”)
However, there are some situations in which we don’t have a response variable, and we are primarily interested in summarising or understanding our data. This is the setting of unsupervised learning.
The first topic of today’s lecture, clustering, is the unsupervised analog of classification.
As an example, consider the following scatterplot of penguins Bill Lengths plotted against their Body Mass:
By eye, it looks like there are potentially two main clusters.
But the boundaries between these clusters are perhaps a bit “fuzzy”.
For example, which group should the circled point belong to?
Question
Given p variables, can we classify observations into two or more groups?
Clustering Techniques seek to address this very question.
Now, to be clear, we are not making any impositions about whether or not true subpopulations exist.
It is possible that there exist subpopulations, like with the penguins dataset:
In this problem, there just so happened to exist three subpopulations in our data, and that this was what was driving our observations about clusters.
But clustering works just as well when there aren’t natural subpopulations in the data.
penguins_num <- penguins %>% select(where(is.numeric))
penguins_num <- penguins_num[penguins_num %>% complete.cases(), ]
penguins_num %>% head() %>% pander()
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g |
---|---|---|---|
39.1 | 18.7 | 181 | 3750 |
39.5 | 17.4 | 186 | 3800 |
40.3 | 18 | 195 | 3250 |
36.7 | 19.3 | 193 | 3450 |
39.3 | 20.6 | 190 | 3650 |
38.9 | 17.8 | 181 | 3625 |
Even when there exist subpopulations, our clusters may not always reflect them (particularly if two or more subpopulations are very similar to one another).
In this way, we can perhaps think of clustering as identifying “inherent subpopulations” (like PCA uncovers “inherent dimensionality”)
Rep | Party | State | H. R. 788 | S. J. Res. 38 | H. J. Res. 98 | H. R. 6918 | H. R. 6914 | H. R. 5585 | H. R. 6678 | H. R. 6679 | H. R. 6976 | H. R. 485 | H. R. 7176 | H. R. 7511 | H. R. 2799 | H. R. 6276 | H. R. 1121 | H. R. 6009 | H. R. 7023 | H. R. 1023 | H. R. 7888 | H. R. 4639 | H. R. 6046 | H. R. 4691 | H. R. 5947 | H. R. 6323 | H. R. 8038 | H. R. 8036 | H. R. 8035 | H. R. 8034 | H. R. 529 | H. R. 3397 | H. R. 615 | H. R. 764 | H. R. 3195 | H. R. 6090 | H. R. 6285 | H. R. 6192 | H. J. Res. 109 | H. R. 2925 | H. R. 7109 | H. R. 7530 | H. R. 7581 | H. R. 7343 | H. R. 354 | H. R. 8146 | H. R. 8369 | H. R. 4763 | H. R. 5403 | H. R. 192 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Adams | Democratic | North Carolina | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Aderholt | Republican | Alabama | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Aguilar | Democratic | California | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
Alford | Republican | Missouri | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Allen | Republican | Georgia | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Allred | Democratic | Texas | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 |
Identify the cluster centroids by minimizing the variance within each cluster
Identify the cluster assignments by finding the shortest Euclidean distance to a centroid
\[\min_{C_1, \cdots, C_K} \left\{ \sum_{k=1}^{K} \frac{1}{|C_k|} \sum_{i, i' \in C_k} \sum_{j=1}^{p} (x_{ij} - x_{i' j})^2 \right\}\]
This optimization problem is, in general, intractable.
Thankfully, a local solution can be obtained using the following iterative algorithm (called the K-means clustering algorithm)
K-Means Clustering Algorithm
set.seed(100) ## for reproducibility
km_votes <- kmeans(votes_num, centers = 2)
prcomp(votes_num, scale. = TRUE)$x[,1:2] %>%
data.frame() %>%
mutate(kmeans_clust = factor(km_votes$cluster)) %>%
ggplot(aes(x = PC1, y = PC2)) +
geom_point(size = 3,
aes(col = kmeans_clust)) +
theme_minimal(base_size = 18) +
ggtitle("PCA Plot of Votes Dataset",
subtitle = "Clustered using K-Means") +
scale_color_okabe_ito() +
labs(col = "Cluster")
cluster | Democratic | No data found | Republican |
---|---|---|---|
1 | 207 | 1 | NA |
2 | 4 | NA | 216 |
Rep | Party | State |
---|---|---|
Cuellar | Democratic | Texas |
Davis (NC) | Democratic | North Carolina |
Golden (ME) | Democratic | Maine |
Perez | Democratic | Washington |
This person falls well within our “Democratcic” cluster, meaning they are likely a Democrat.
In this way, we can see that clustering can, in some cases, help us with missing data.
Indeed, perhaps this is a good segue into our next topic for today…
Missing data occurs when one or more variables have observations that are not present.
There are a variety of reasons why data might be missing:
Missing values are often encoded using a special symbol. In R
, missing values are by default mapped to the symbol NA
.
Admittedly, missing data is the bane of most data scientists’ existences.
R
break down in the presence of missing dataCaution
Simply throwing out missing values is, in some cases, ill-advised.
So how should missing values be handled in practice?
This is a very hot-topic question!
The general idea is that we need to consider the mechanisms behind the missingness.
Let \(X = \{x_{ij}\}_{(i, j) = (1, 1)}^{(n, p)}\) denote an \(n \times p\) dataset (i.e. a dataset with n observations on p variables).
Denote by \(q_{ij}\) the probability that element \(x_{ij}\) is missing: \[ q_{ij} := \Prob(x_{ij} \text{ is missing}) \]
There are two main cases to consider: one where data is Missing Completely at Random (MCAR) and on where data is Missing at Random (MAR).
We can see that MCAR is perhaps the “best-case” scenario.
With that interpretation, the “worst-case” scenario is when data is Missing Not at Random (MNAR).
If data is MNAR, then the probabilities \(q_{ij}\) depend on both observed values and unobserved values as well: \(q_{ij} = f(z_i, x_{ij})\) for unknown \(z_j\).
Unfortunately, there aren’t any formal tests to determine whether data is MCAR, MAR, or MNAR; the best (and only) way to determine the missingness mechanism is to make an informed assumption based on knowledge about the data collection procedure (which is why we started off today by talking about sampling!)
Here is an example to illustrate the difference between some of these mechanisms, adapted from the article What is the difference between missing completely at random and missing at random? by Krishnan Bhaskaran and Liam Smeeth
Suppose we have a dataset containing several patients’ blood pressures.
Some values are missing.
So, if we looked at an (imaginary) histogram of the missing blood pressure values, this histogram would likely be right-shifted when compared to a histogram of the non-missing values.
If the data were missing MCAR, these two histograms would be the same.
However, we can explain any differences between the (hypothetical) missing values and the non-missing values using observable quantities (e.g. preexisting health conditions, age, etc.)
Alright, let’s say we have data that is MCAR or MAR. What do we do?
Again, one option is to simply drop the missing values (e.g. using tidyr::drop_na()
, or complete.cases()
).
Another option is imputation, which, broadly speaking, refers to the act of trying to “fill in” the missing values in some way.
One idea is to replace the missing value with the mean/median of the surrounding non-missing values (sometimes called mean imputation).
Another imputation technique is to try and predict the missing value from other recorded values in the dataset (sometimes called model-based imputation).
If data is MCAR or MAR, we can try to explicitly model the probability of missingness, and apply bias corrections (like the inverse probability weighting scheme we saw on Lab 4).
Do:
Don’t:
Tomorrow, we’ll start talking a bit about Neural Networks.
In Lab tomorrow, you’ll get some practice with clustering and missing data.
Reminder: keep working on your projects!
PSTAT 100 - Data Science: Concepts and Analysis, Summer 2025 with Ethan P. Marzban