Classification
Department of Statistics and Applied Probability; UCSB
Summer Session A, 2025
\[ \newcommand\R{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\E}{\mathbb{E}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\F}{\mathcal{F}} \newcommand{\1}{1\!\!1} \newcommand{\comp}[1]{#1^{\complement}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\SD}{\mathrm{SD}} \newcommand{\vect}[1]{\vec{\boldsymbol{#1}}} \newcommand{\tvect}[1]{\vec{\boldsymbol{#1}}^{\mathsf{T}}} \newcommand{\hvect}[1]{\widehat{\boldsymbol{#1}}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\tmat}[1]{\mathbf{#1}^{\mathsf{T}}} \newcommand{\Cov}{\mathrm{Cov}} \DeclareMathOperator*{\argmin}{\mathrm{arg} \ \min} \newcommand{\iid}{\stackrel{\mathrm{i.i.d.}}{\sim}} \]
PassengerId | Survived | Pclass | Name | Sex | Age |
---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 |
SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|
1 | 0 | A/5 21171 | 7.25 | S | |
1 | 0 | PC 17599 | 71.28 | C85 | C |
0 | 0 | STON/O2. 3101282 | 7.925 | S |
Question: given a passenger’s information (e.g. sex, class, etc.), can we predict whether or not they would have survived the crash?
Firstly, based on domain knowledge available to us, we believe there to be a relationship between survival rates and demographics.
To make things more explicit, let’s suppose we wish to predict survival based solely on a passenger’s age.
This lends itself nicely to a model, with:
1
for survived
, or 0
for died
)Now, note that our response is categorical. Hence, our model is a classification model, as opposed to a regression one.
The (parametric) modeling approach is still the same:
We just have to be a bit more creative about our model proposition.
Let’s see what happens if we try to fit a “linear” model: y
i
= β0 + β1 x
i
+ εi
y
i
= β0 + β1 x
i
+ εi
y
i
will either be zero or one.x
i
will be a positive number, not necessarily constrained to be either 0 or 1.i
= β0 + β1 x
i
is still not a valid model, since πi is constrained to be between 0 and 1, whereas (β0 + β1 x
i
) is unconstrained.Third Idea: apply a transformation to β0 + β1 x
i
.
Specifically, if we can find a function g that maps from the real line to the unit interval, then a valid model would be πi
= g(β0 + β1 x
i
).
What class of (probabilistic) functions map from the real line to the unit interval?
Indeed, we can pick any CDF to be our transformation g. There are two popular choices, giving rise to two different models:
Probit Model: πi
= Φ(β0 + β1 x
i
)
\[ \Phi(x) := \int_{-\infty}^{x} \frac{1}{\sqrt{2\pi}} e^{-z^2 / 2} \ \mathrm{d}z \]
Logit Model: πi
= Λ(β0 + β1 x
i
)
\[ \Lambda(x) := \frac{1}{1 + e^{-x}} \]
As an example, let’s return to our Titanic example where πi
represents the probability that the ith passenger survived, and x
i
denotes the ith passenger’s age.
A logistic regression model posits \[ \pi_i = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_i)}} \]
Equivalently, \[ \ln\left( \frac{\pi_i}{1 - \pi_i} \right) = \beta_0 + \beta_1 x_i \]
Now, as with pretty much all statistical models, there are some assumptions that must be met in order for a Logistic Regression Model (LRM) to be appropriate.
The two main assumptions of the LRM are:
Technically, we are also assuming that the true conditional probability of success \(\Prob(Y_i = 1 \mid X_i = x_i)\) is logistically-related to the covariate value xi.
In PSTAT 100, we’ll opt for a “knowledge-based” approach to check if the assumptions are met; that is, we’ll simply use our knowledge/intuition behind the true DGP.
x
i
is modeled to be associated with a β1 -unit increase in the log-odds of πi.R
, we fit a logistic regression using the glm()
function.
Call:
glm(formula = Survived ~ Age, family = "binomial", data = titanic)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.05672 0.17358 -0.327 0.7438
Age -0.01096 0.00533 -2.057 0.0397 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 964.52 on 713 degrees of freedom
Residual deviance: 960.23 on 712 degrees of freedom
(177 observations deleted due to missingness)
AIC: 964.23
Number of Fisher Scoring iterations: 4
\[ \ln\left( \frac{\widehat{\pi}_i}{1 - \widehat{\pi_i}} \right) = -0.05672 -0.01096 x_i \]
age
corresponds to a decrease in the log-odds of survival.
family = "binomial"
in our call to glm()
?
glm_age <- glm(Survived ~ Age, data = titanic, family = "binomial")
(p1 <- predict(glm_age, newdata = data.frame(Age = 24)))
1
-0.3198465
Caution
predict.glm()
will give you the predicted log-odds - to find the true predicted survival probability, you need to invert.
glm1_c <- glm(Survived ~ Fare, data = titanic, family = "binomial") %>% coef()
pred_surv <- Vectorize(function(x){
1 / (1 + exp(-glm1_c[1] - glm1_c[2] * x))
})
titanic %>% ggplot(aes(x = Fare, y = Survived)) +
geom_point(size = 3) +
theme_minimal(base_size = 18) +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE,
aes(colour = "Linear Regression"), linewidth = 2) +
stat_function(fun = pred_surv,
aes(colour = "Logistic Regression"), linewidth = 2) +
labs(colour = "Survival Probability") +
ggtitle("Survival Status vs. Fare")
\[\begin{align*} \pi_i & = \Lambda\left( \beta_0 + \sum_{j=1}^{p} \beta_j x_{ij} \right) = \frac{1}{1 - e^{-\left(\beta_0 + \sum_{j=1}^{p} \beta_j x_{ij} \right)}} \\ \mathrm{logit}(\pi_i) & = \beta_0 + \sum_{j=1}^{p} \beta_j x_{ij} \end{align*}\]
Your Turn!
Adebimpe has found that a good predictor of whether an email is spam or not is the number of times the word “promotion” appears in its body. To that end, she has fit a logistic regression model to model an email’s spam/ham status as it relates to the number of times the word “promotion” appears. The resulting regression table is displayed below:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.68748 0.04360 15.768 < 2e-16 ***
num_prom 0.10258 0.01844 5.564 1.2e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
0.01844
in the context of this problem.03:00
Now, logistic regression gets us estimated survival probabilities.
It does not, however, give us survival statuses - to get those, we need to build a classifier.
In binary classification (i.e. where our original response takes only two values, survived
or not
), our classifier typically takes the form: assign y
i a value of survived
if the survival probability is above some threshold c, and assign y
i a value of did not survive
if the survival probability falls below the threshold.
Fare
.For example, in the context of the Titanic dataset:
The True Positive Rate (aka sensitivity) is the proportion of passengers who actually survived that were correctly classified as having survived.
The False Positive Rate (aka one minus the specificity) is the proportion of passengers who actually died that were incorrectly classified as having survived.
Classifier: \(\{Y_i = 1\} \iff \{ \widehat{\pi}_i > 0.5\}\)
truth_+ | truth_- | |
---|---|---|
class_+ | 82 | 38 |
class_- | 260 | 511 |
TPR: 0.2397661
FPR: 0.06921676
Classifier: \(\{Y_i = 1\} \iff \{ \widehat{\pi}_i > 0.9\}\)
truth_+ | truth_- | |
---|---|---|
class_+ | 14 | 6 |
class_- | 328 | 543 |
TPR: 0.04093567
FPR: 0.01092896
So, we can see that our TPR and TNR will change depending on the cutoff value we select for our classifier.
This gives us the idea to perhaps use quantities like TPR and TNR to compare across different cutoff values.
Rather than trying to compare confusion matrices, it’s a much nicer idea to try and compare plots.
One such plot is called a Receiver Operating Characteristic (ROC) Curve, which plots the sensitivity (on the vertical axis) against (1 - specificity) (on the horizontal axis)
Allow me to elaborate a bit more on this last point.
The vertical axis of a ROC curve effectively represents the probability of a good thing; ideally, we’d like a classifier that has a 100% TPR!
Simultaneously, an ideal classifier would also have a 0% FPR (which is precisely what is plotted on the horizontal axis of an ROC curve).
Fare
, Age
, Sex
, and Cabin
as predictorsFare
and Age
as predictorsPSTAT 100 - Data Science: Concepts and Analysis, Summer 2025 with Ethan P. Marzban