PSTAT 100: Lecture 21

An Very Brief Introduction to Causal Inference

Ethan P. Marzban

Department of Statistics and Applied Probability; UCSB

Summer Session A, 2025

\[ \newcommand\R{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\E}{\mathbb{E}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\F}{\mathcal{F}} \newcommand{\1}{1\!\!1} \newcommand{\comp}[1]{#1^{\complement}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\SD}{\mathrm{SD}} \newcommand{\vect}[1]{\vec{\boldsymbol{#1}}} \newcommand{\tvect}[1]{\vec{\boldsymbol{#1}}^{\mathsf{T}}} \newcommand{\hvect}[1]{\widehat{\boldsymbol{#1}}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\tmat}[1]{\mathbf{#1}^{\mathsf{T}}} \newcommand{\Cov}{\mathrm{Cov}} \DeclareMathOperator*{\argmin}{\mathrm{arg} \ \min} \newcommand{\iid}{\stackrel{\mathrm{i.i.d.}}{\sim}} \]

UC Berkeley Admissions

In the 1970’s, UC Berkeley conducted an observational study to determine whether or not there was gender bias in the graduate student admittance practices at the university.
- A disclaimer: at the time, “gender” was treated as a binary variable with values male and female. I would like to also acknowledge that we now recognize that there are a great deal many more genders than simply “male” and “female”.
Overall, the survey included 8,422 men and 4,321 women.
Of the men 44% were admitted; of the women only 35% were admitted.
- This difference was also deemed statistically significant.
So, on the surface, it does appear as though women are being disproportionately denied entry.

UC Berkeley Admissions

But, something puzzling happens when we take a look at the data after grouping by major:

	Men		Women
Major	Num. Applicants	% Admitted	Num. Applicants	% Admitted
A	825	62	108	82
B	560	63	25	68
C	325	37	593	34
D	417	33	375	35
E	191	28	393	24
F	373	6	341	7

UC Berkeley Admissions

Nearly none of the majors on their own display this bias against women.
- In fact, in Major A there almost appears to be a bias against men
So, what’s going on? How can it be that none of the majors individually display a discrimination against women, but overall they display discrimination against women?
The answer lies in how difficult each major was to get into.
For instance, Major A appears to have an overall 64% acceptance rate, whereas Major E appears to have an overall 53.62% acceptance rate.
- Major A seems to be harder to get into than, say, Major E.
- Majors A and B are easier to get into than majors C through F.

UC Berkeley Admissions

Indeed, if we look at the Num. Applicants column within each gender, we see that, on the aggregate, men were applying to easier majors!
- Over half of men applied to Majors A and B (the “easy” majors): \[\frac{825 + 560}{825 + 560 + 325 + 417 + 191 + 373} \approx 51.5\% \] whereas nearly 90% of women applied to Majors C through F (the “hard” majors): \[ \frac{593 + 375 + 393 + 341}{108 + 25 + 593 + 375 + 393 + 341} \approx 92.8 \]
In other words, difficulty of major was a confounding variable that influenced the acceptance rates.

UC Berkeley Admissions

After controlling for this variable, it was actually found that there was no significant difference in admittance rates between men and women.
As an aside, this relates to what is known as Simpson’s Paradox, a well-documented statistical phenomenon in which relationships between percentages in subgroups can sometimes be reversed after the subgroups are aggregated.
But, for now, I use this example as a way to re-introduce us to the notion of confounding variables.
Intuitively, we can think of a confounding variable as a variable that affects a relationship of interest, but that is not explicitly modeled or controlled for.
This is a pretty vague definition; we’ll revisit the notion of confounding in a few slides, once we’ve gotten a bit of basics under our belt.

Causal Inference

Causality

We consider an outcome (or response) variable, which we denote by Y.
We also consider a treatment, whose effect on the response is what we are interested in exploring.
- Other terms for “treatment” include intervention and manipulation
As an example, suppose we let Y denote the pain rating (on a scale from 1 to 10) of a headache.
If we’re interested in the effect taking Aspirin has on this pain rating, our treatment is taking Aspirin or not.
If we’re interested in the effect a pilot program has on AP Calculus AB scores, our treatment is being a part of the program or not.

Causality

Now, note one important distinction: we are not, for example, asking “whether or not taking Aspirin causes a decrease in pain levels.”
- Rather, we are asking what magnitude of effect taking/not taking Aspirin has on pain levels.
So, what do we mean by “effect”?
Here’s the general idea. Let Y_i⁽¹⁾ denote the response value of the i^th individual, assuming they have undergone treatment.
- Analogously, let Y_i⁽⁰⁾ denote the response value of the i^th individual, assuming they have not undergone treatment.
For example, in the context of our headache example, Y_i⁽¹⁾ might denote John’s pain level on Aspirin and Y_i⁽⁰⁾ would denote John’s pain level off of Aspirin.

Causality

The true effect of treatment on the i^th individual would then just be 𝜏_i := Y_i⁽¹⁾ - Y_i⁽⁰⁾.
- Again, this represents, for example, the difference in pain levels John experiences on and off Aspirin; this difference is precisely the effect Aspirin has on reducing John’s pain.

w/ Treatment	w/o Treatment
Y₁⁽¹⁾	Y₁⁽⁰⁾
Y₂⁽¹⁾	Y₂⁽⁰⁾
⋮	⋮
Y_n⁽¹⁾	Y_n⁽⁰⁾

In practice, however, we have to contend with the fundamental problem of causal inference: for each individual, we only get to observe the response on or off treatment - never both.

Causality

To stress, in the headache example: Y_i⁽¹⁾ and Y_i⁽⁰⁾ represent John’s pain levels on and off Aspirin at the same time, assuming no changes in John’s status other than his Aspirin usage.
- We cannot observe both, because, at any given time, John is either on or off Aspirin.
To that end, we call Y_i⁽¹⁾ and Y_i⁽⁰⁾ potential outcomes.
- We call the 𝜏_i individual treatment effects (ITE).
The ITE are unknown and unknowable, since (again) we never observe both potential outcomes.
So, what do we observe?

Causality

Each individual is either administered treatment or not.
- Let Z_i denote the assignment indicator of the i^th individual; that is (Z_i = 1) if the i^th individual is administered treatment and (Z_i = 0) otherwise
Then, our observed data might look like:

Y₁⁽¹⁾	Y₀⁽¹⁾	Z_i
\(\bullet\)	`NA`	1
⋮		0
\(\bullet\)	`NA`	1
`NA`	\(\bullet\)	0
	⋮
`NA`	\(\bullet\)	0

Again, as much as we would like to be able to determine the 𝜏_i := Y_i⁽¹⁾ - Y_i⁽⁰⁾, we are unable to do so because, for each i, one of these values is missing.
- So, in a sense, the fundamental problem of causal inference is one about missing data!

Causality

We can think of the ITEs (𝜏_i) as population parameters. However, they are unestimable.
Instead, we can focus on the average causal effect (ACE): \[ \tau := \frac{1}{n} \sum_{i=1}^{n} \tau_i =: \frac{1}{n} \sum_{i=1}^{n} \left[Y_{i}^{(1)} - Y_{i}^{(0)} \right] \]
Our goal will be to estimate this; that is, we wish to determine an estimate for the average causal effect of treatment on the response.
- E.g. the average causal effect Aspirin usage has on pain levels.
Let’s establish some assumptions and define some notation.

Causality

Assumptions

We make the following assumptions:
1. No Interference: The potential outcomes of unit i are independent of other units’ potential outcomes
2. Consistency: Y_i = Z_i Y_i⁽¹⁾ + (1 - Z_i ) Y_i⁽⁰⁾ where Y_i denotes the i^th observed response (in other words, each observed response corresponds to either the on-treatment or off-treatment potential outcome).
These two assumptions are collectively referred to as the Stable Unit Treatment Value Assumption (SUTVA).
Another assumption we will make is that we are in the context of a completely randomized experiment (CRE).

Completely Randomized Experiments

We’ve talked briefly about experiments (as opposed to observational studies) before.
Essentially, we can think of an experiment as a study in which we (the designers) control who gets and doesn’t get treatment.
- In the notation we’ve established today, this means we explicitly control the Z_i values.
A CRE is one in which the Z_i’s are, in a sense, completely random.
Here’s a more formal definition:

Completely Randomized Experiments

Definition: Completely Randomized Experiment

Let Z_i denote the allocation indicator for the i^th unit Let n₁ denote the number of units on treatment and let n₀ denote the number of units off treatment; define n := n₁ + n₂. A completely randomized experiment is one for which \[ \mathbb{P}(\vect{Z} = \vect{z}) = \frac{1}{\binom{n}{n_1}}\] where \(\vect{z} = (z_1, \cdots, z_n)\) satisfies \(\sum_{i=1}^{n} z_i = n_1\) and \(\sum_{i=1}^{n} (1 - z_i) = n_0\).

In other words, “every assignment configuration is equally likely.”
- Later, we’ll discuss some situations in which this assumption may not hold.

Causality

Notation

Okay, so that takes care of assumptions: we’ll assume SUTVA, and that we’re in the context of a CRE.
Let’s now establish some notation.
From the population, we define:

\[ \begin{align*} \textbf{Population Means:} & \qquad \overline{Y^{(1)}} := \frac{1}{n} \sum_{i=1}^{n} Y_i^{(1)}; \qquad \overline{Y^{(0)}} := \frac{1}{n} \sum_{i=1}^{n} Y_i^{(0)} \\ \textbf{Population Var's:} & \qquad S^2_{(j)} := \frac{1}{n - 1} \sum_{i=1}^{n} \left[ Y_i^{(j)} - \overline{Y_i^{(j)}} \right]^2, \ j = 0, 1 \\ \textbf{Population Cov's:} & \qquad S_{(1)(0)} := \frac{1}{n - 1} \sum_{i=1}^{n} \left[ Y_i^{(1)} - \overline{Y_i^{(1)}} \right] \left[ Y_i^{(0)} - \overline{Y_i^{(0)}} \right] \end{align*} \]

Causality

Notation

With this notation, we can reformulate the ACE as \(\tau := \overline{Y^{(1)}} - \overline{Y^{(0)}}\).
- Again, this is the parameter we’re interested in estimating.
The variance of the ITEs is given by \[ S^2_{(\tau)} := \frac{1}{n - 1} \sum_{i=1}^{n} (\tau_i - \tau)^2 \]

Lemma 4.1

\[ 2 S_{(1)(0)} = S^2_{(1)} + S^2_{(2)} - 2 S^2_{(\tau)}\]

Causality

Notation

Now, remember what we actually observe: for any unit i, we only either observe Y_i⁽¹⁾ or Y_i⁽⁰⁾, never both.
So, it seems natural to introduce some sample quantities (in contrast to the population quantities we defined above).
For example, suppose we want to compute the average observed response value among those on the treatment.
- Using our assignment indicator Z_i, we can cleverly define our sample means as:

\[ \widehat{\overline{Y^{(1)}}} := \frac{1}{n_1} \sum_{i=1}^{n} Z_i Y_i ; \qquad \widehat{\overline{Y^{(0)}}} := \frac{1}{n_0} \sum_{i=1}^{n} (1 - Z_i) Y_i \]

Causality

Notation

Allow me to expound upon this a bit further.
Recall that Y_i denotes the observed response of the i^th unit.
- We are encoding information about whether this is an on- or off-treatment measurement using our assignment indicator Z_i.
Now, our first sum does technically range over all indices i from 0 to n.
- However, for any response values corresponding to an off-treatment observation, the associated indicator will be 0 and our sum effectively only ranges over the indices for on-treatment units.
Though this may seem convoluted at first, it is actually a very neat way to succinctly express our sample averages!

Causality

Notation

With this in mind, we can define our sample variances:

\[\begin{align*} \widehat{S}^2_{(1)} & := \frac{1}{n_1 - 1} \sum_{i=1}^{n} Z_i \left[ Y_i - \widehat{\overline{Y^{(1)}}} \right]^2 \\ \widehat{S}^2_{(0)} & := \frac{1}{n_0 - 1} \sum_{i=1}^{n} (1 - Z_i) \left[ Y_i - \widehat{\overline{Y^{(0)}}} \right]^2 \end{align*}\]

Check Your Understanding

How might we define the sample covariance (if possible)?

Causality

Taking Stock

Whew, that’s a lot of setup! Before we proceed, let’s quickly take stock of what we’ve done.
For each unit i, we have associated potential outcomes Y_i⁽¹⁾ and Y_i⁽⁰⁾ indicating response values on- and off-treatment, respectively.
- The fundamental problem of causal inference is that, for any i, we only observe one of these.
This fundamental problem poses challenges for estimating the individual treatment effects (ITE) 𝜏_i := Y_i⁽¹⁾ - Y_i⁽⁰⁾ or the average causal effect (ATE) 𝜏.
- Hence, we would like to develop an estimate for 𝜏.

Causality

Taking Stock

(Most) Every estimation problem requires assumptions: our assumptions are SUTVA, and that we are in the context of a CRE.
We define the population means, population standard deviations, and population covariance as on a few slides ago.
- We also define sample analogs for some of these, also as outlined a few slides ago.
We are now in a position to posit an estimator for the ATE!
- This particular estimate is due to Jerzy Spława-Neyman, one of the foremost statisticians of the 20^th century.

Causality

Estimated ATE

Theorem

Under a CRE (and assuming SUTVA), an unbiased estimator for the ACE is given by \[ \widehat{\tau} := \widehat{\overline{Y^{(1)}}} - \widehat{\overline{Y^{(0)}}} \]
The variance of this estimator is given by

\[\begin{align*} \mathrm{Var}(\widehat{\tau}) & = \frac{S_{(1)}^2}{n_1} + \frac{S_{(0)}^2}{n_0} - \frac{S_{(\tau)}^2}{n} \\ & = \frac{n_0}{n_1 n} S_{(1)}^2 + \frac{n_1}{n_0 n} S_{(0)}^2 + \frac{2}{n} S_{(1)(0)} \end{align*}\]

Note that (as is typical with estimators), the variance of our estimator depends on population parameters.

Causality

Estimated Variance of the Estimated ATE

Theorem

Define the following estimator for the variance of \(\widehat{\tau}\): \[ \widehat{V} := \frac{\widehat{S}_{(1)}^2}{n_1} + \frac{\widehat{S}_{(0)}^2}{n_0} \] This estimate is conservative for estimating \(\mathrm{Var}(\widehat{\tau})\) is the sense that \[ \mathbb{E}[\widehat{V}] - \mathrm{Var}(\widehat{\tau}) = \frac{S^2_{(\tau)}}{n} \geq 0 \]

Causality

Proofs

I think it may be useful to go through the proof of the first part of the theorem.
- Let’s do so on the board, together!

Causality

Theorem

Theorem

\[ \frac{\widehat{\tau} - \tau}{\sqrt{\Var(\widehat{\tau})}} \rightsquigarrow \mathcal{N}(0, 1) \]

This allows us to perform hypothesis testing, construct confidence intervals, etc.

Confounding

Let’s also quickly discuss how confounding enters our model.
Up until now, we’ve essentially been assuming the absence of confounders.
- Specifically, this is reflected in our CRE assumption; equivalently, in our assumption that the assignment indicators follow i.i.d. Bernoulli distributions.

We can propose a slightly more causal-inference-specific definition for a confounding variable, as a variable that affects both treatment and the response.

Confounding

Because the confounder affects the treatment, it will affect our assignment indicator Z_i.
Perhaps it’s useful to (again) think back to our Aspirin/headache example. One possible confounder might be level of exercise - heavy exercise will definitely affect pain levels, but it will also affect how likely someone is to take Aspirin.
- Hence, level of exercise is likely a confounding variable.
So, here’s an idea: given a confounding variable X, why don’t we run a logistic regression of Z_i onto X?
- The resulting probability will be an estimate of the true assignment probability π_i := ℙ(Z_i = 1 | X), which we sometimes call a propensity score.

Confounding

Propensity Scores

There’s a lot we can do with propensity scores!
- We can perform Inverse-Propensity Weighting, a form of bias correction (like the Inverse-Probability Weighting scheme we discussed in Lab a few weeks back)
- We can condition on these propensity scores to mitigate the effects of confounders.
A popular technique for overcoming the effects of confounding is called matching, in which individuals from the on-treatment group are matched with “similar” individuals from the off-treatment group.
- Similarity is dictated by the confounding values; e.g. we pick someone with the same age, sex, race as, say, John

Confounding

Propensity Scores

Naturally, when there are many confounders, matching based on the raw confounder values can be very challenging and results in significant data loss.
A clever idea is to match based not on the raw values of the confounding variables, but rather the propensity scores.
I encourage you to read more in [A First Course in Causal Inference] by Peng Ding, if you are interested.
To close out, I’ll briefly outline a relatively famous case study.

National Supported Work Demonstration

A Quick Case Study

The National Supported Work Demonstration (NSW) was a employment program that ran between March 1975 and June 1977.
Essentially, the program offered employment training to participants in the hopes of decreasing disparities.
Initial findings seemed to indicate that those who underwent the training had lower average incomes than those who did not - as such, on the surface, it seemed like the training actually hurt people’s chances of high-level employment later in life.
In 1986, Robert J. LaLonde conducted a causal analysis of the findings of the study.
- You can read his full paper here.

National Supported Work Demonstration

A Quick Case Study

LaLonde primarily pointed out that the original study was flawed in that it did not appropriately consider confounding variables!
- For example, things like gender, race, and education level are likely confounders in this experiment as they affect both treatment (whether people were administered the training or not) and response (income levels).
Subsequent studies have shown that, after appropriately matching based on propensity scores, it can be shown that the training actually had a net positive causal effect.

Next Time

Please fill out Course Evaluations!
I’ll release the Bonus Lab (Lab 11) sometime today or tomorrow.
- It will be due at 11:59pm on Friday, August 1, 2025.
Tomorrow will be our final lecture (I’ll also bring some hex stickers with the PSTAT 100 course logo for everyone tomorrow!)
During Section tomorrow, you can either work on the bonus lab or work on the project.
- A friendly reminder that Erika (our TA) is a great resource to talk to!