PSTAT 100: Lecture 21

An Very Brief Introduction to Causal Inference

Ethan P. Marzban

Department of Statistics and Applied Probability; UCSB

Summer Session A, 2025

\[ \newcommand\R{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\E}{\mathbb{E}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\F}{\mathcal{F}} \newcommand{\1}{1\!\!1} \newcommand{\comp}[1]{#1^{\complement}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\SD}{\mathrm{SD}} \newcommand{\vect}[1]{\vec{\boldsymbol{#1}}} \newcommand{\tvect}[1]{\vec{\boldsymbol{#1}}^{\mathsf{T}}} \newcommand{\hvect}[1]{\widehat{\boldsymbol{#1}}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\tmat}[1]{\mathbf{#1}^{\mathsf{T}}} \newcommand{\Cov}{\mathrm{Cov}} \DeclareMathOperator*{\argmin}{\mathrm{arg} \ \min} \newcommand{\iid}{\stackrel{\mathrm{i.i.d.}}{\sim}} \]

UC Berkeley Admissions

  • In the 1970’s, UC Berkeley conducted an observational study to determine whether or not there was gender bias in the graduate student admittance practices at the university.

    • A disclaimer: at the time, “gender” was treated as a binary variable with values male and female. I would like to also acknowledge that we now recognize that there are a great deal many more genders than simply “male” and “female”.
  • Overall, the survey included 8,422 men and 4,321 women.

  • Of the men 44% were admitted; of the women only 35% were admitted.

    • This difference was also deemed statistically significant.
  • So, on the surface, it does appear as though women are being disproportionately denied entry.

UC Berkeley Admissions

  • But, something puzzling happens when we take a look at the data after grouping by major:
Men Women
Major Num. Applicants % Admitted Num. Applicants % Admitted
A 825 62 108 82
B 560 63 25 68
C 325 37 593 34
D 417 33 375 35
E 191 28 393 24
F 373 6 341 7

UC Berkeley Admissions

  • Nearly none of the majors on their own display this bias against women.

    • In fact, in Major A there almost appears to be a bias against men
  • So, what’s going on? How can it be that none of the majors individually display a discrimination against women, but overall they display discrimination against women?

  • The answer lies in how difficult each major was to get into.

  • For instance, Major A appears to have an overall 64% acceptance rate, whereas Major E appears to have an overall 53.62% acceptance rate.

    • Major A seems to be harder to get into than, say, Major E.
    • Majors A and B are easier to get into than majors C through F.

UC Berkeley Admissions

  • Indeed, if we look at the Num. Applicants column within each gender, we see that, on the aggregate, men were applying to easier majors!

    • Over half of men applied to Majors A and B (the “easy” majors): \[\frac{825 + 560}{825 + 560 + 325 + 417 + 191 + 373} \approx 51.5\% \] whereas nearly 90% of women applied to Majors C through F (the “hard” majors): \[ \frac{593 + 375 + 393 + 341}{108 + 25 + 593 + 375 + 393 + 341} \approx 92.8 \]
  • In other words, difficulty of major was a confounding variable that influenced the acceptance rates.

UC Berkeley Admissions

  • After controlling for this variable, it was actually found that there was no significant difference in admittance rates between men and women.

  • As an aside, this relates to what is known as Simpson’s Paradox, a well-documented statistical phenomenon in which relationships between percentages in subgroups can sometimes be reversed after the subgroups are aggregated.

  • But, for now, I use this example as a way to re-introduce us to the notion of confounding variables.

  • Intuitively, we can think of a confounding variable as a variable that affects a relationship of interest, but that is not explicitly modeled or controlled for.

  • This is a pretty vague definition; we’ll revisit the notion of confounding in a few slides, once we’ve gotten a bit of basics under our belt.

Causal Inference

Causality

  • We consider an outcome (or response) variable, which we denote by Y.

  • We also consider a treatment, whose effect on the response is what we are interested in exploring.

    • Other terms for “treatment” include intervention and manipulation
  • As an example, suppose we let Y denote the pain rating (on a scale from 1 to 10) of a headache.

  • If we’re interested in the effect taking Aspirin has on this pain rating, our treatment is taking Aspirin or not.

  • If we’re interested in the effect a pilot program has on AP Calculus AB scores, our treatment is being a part of the program or not.

Causality

  • Now, note one important distinction: we are not, for example, asking “whether or not taking Aspirin causes a decrease in pain levels.”

    • Rather, we are asking what magnitude of effect taking/not taking Aspirin has on pain levels.
  • So, what do we mean by “effect”?

  • Here’s the general idea. Let Yi(1) denote the response value of the ith individual, assuming they have undergone treatment.

    • Analogously, let Yi(0) denote the response value of the ith individual, assuming they have not undergone treatment.
  • For example, in the context of our headache example, Yi(1) might denote John’s pain level on Aspirin and Yi(0) would denote John’s pain level off of Aspirin.

Causality

  • The true effect of treatment on the ith individual would then just be 𝜏i := Yi(1) - Yi(0).
    • Again, this represents, for example, the difference in pain levels John experiences on and off Aspirin; this difference is precisely the effect Aspirin has on reducing John’s pain.
w/ Treatment w/o Treatment
Y1(1) Y1(0)
Y2(1) Y2(0)
Yn(1) Yn(0)
  • In practice, however, we have to contend with the fundamental problem of causal inference: for each individual, we only get to observe the response on or off treatment - never both.

Causality

  • To stress, in the headache example: Yi(1) and Yi(0) represent John’s pain levels on and off Aspirin at the same time, assuming no changes in John’s status other than his Aspirin usage.

    • We cannot observe both, because, at any given time, John is either on or off Aspirin.
  • To that end, we call Yi(1) and Yi(0) potential outcomes.

    • We call the 𝜏i individual treatment effects (ITE).
  • The ITE are unknown and unknowable, since (again) we never observe both potential outcomes.

  • So, what do we observe?

Causality

  • Each individual is either administered treatment or not.
    • Let Zi denote the assignment indicator of the ith individual; that is (Zi = 1) if the ith individual is administered treatment and (Zi = 0) otherwise
  • Then, our observed data might look like:
Y1(1) Y0(1) Zi
\(\bullet\) NA 1
0
\(\bullet\) NA 1
NA \(\bullet\) 0
NA \(\bullet\) 0
  • Again, as much as we would like to be able to determine the 𝜏i := Yi(1) - Yi(0), we are unable to do so because, for each i, one of these values is missing.
    • So, in a sense, the fundamental problem of causal inference is one about missing data!

Causality

Causality

  • We can think of the ITEs (𝜏i) as population parameters. However, they are unestimable.

  • Instead, we can focus on the average causal effect (ACE): \[ \tau := \frac{1}{n} \sum_{i=1}^{n} \tau_i =: \frac{1}{n} \sum_{i=1}^{n} \left[Y_{i}^{(1)} - Y_{i}^{(0)} \right] \]

  • Our goal will be to estimate this; that is, we wish to determine an estimate for the average causal effect of treatment on the response.

    • E.g. the average causal effect Aspirin usage has on pain levels.
  • Let’s establish some assumptions and define some notation.

Causality

Assumptions

  • We make the following assumptions:

    1. No Interference: The potential outcomes of unit i are independent of other units’ potential outcomes
    2. Consistency: Yi = Zi Yi(1) + (1 - Zi ) Yi(0) where Yi denotes the ith observed response (in other words, each observed response corresponds to either the on-treatment or off-treatment potential outcome).
  • These two assumptions are collectively referred to as the Stable Unit Treatment Value Assumption (SUTVA).

  • Another assumption we will make is that we are in the context of a completely randomized experiment (CRE).

Completely Randomized Experiments

  • We’ve talked briefly about experiments (as opposed to observational studies) before.

  • Essentially, we can think of an experiment as a study in which we (the designers) control who gets and doesn’t get treatment.

    • In the notation we’ve established today, this means we explicitly control the Zi values.
  • A CRE is one in which the Zi’s are, in a sense, completely random.

  • Here’s a more formal definition:

Completely Randomized Experiments

Definition: Completely Randomized Experiment

Let Zi denote the allocation indicator for the ith unit Let n1 denote the number of units on treatment and let n0 denote the number of units off treatment; define n := n1 + n2. A completely randomized experiment is one for which \[ \mathbb{P}(\vect{Z} = \vect{z}) = \frac{1}{\binom{n}{n_1}}\] where \(\vect{z} = (z_1, \cdots, z_n)\) satisfies \(\sum_{i=1}^{n} z_i = n_1\) and \(\sum_{i=1}^{n} (1 - z_i) = n_0\).

  • In other words, “every assignment configuration is equally likely.”
    • Later, we’ll discuss some situations in which this assumption may not hold.

Causality

Notation

  • Okay, so that takes care of assumptions: we’ll assume SUTVA, and that we’re in the context of a CRE.

  • Let’s now establish some notation.

  • From the population, we define:

\[ \begin{align*} \textbf{Population Means:} & \qquad \overline{Y^{(1)}} := \frac{1}{n} \sum_{i=1}^{n} Y_i^{(1)}; \qquad \overline{Y^{(0)}} := \frac{1}{n} \sum_{i=1}^{n} Y_i^{(0)} \\ \textbf{Population Var's:} & \qquad S^2_{(j)} := \frac{1}{n - 1} \sum_{i=1}^{n} \left[ Y_i^{(j)} - \overline{Y_i^{(j)}} \right]^2, \ j = 0, 1 \\ \textbf{Population Cov's:} & \qquad S_{(1)(0)} := \frac{1}{n - 1} \sum_{i=1}^{n} \left[ Y_i^{(1)} - \overline{Y_i^{(1)}} \right] \left[ Y_i^{(0)} - \overline{Y_i^{(0)}} \right] \end{align*} \]

Causality

Notation

  • With this notation, we can reformulate the ACE as \(\tau := \overline{Y^{(1)}} - \overline{Y^{(0)}}\).
    • Again, this is the parameter we’re interested in estimating.
  • The variance of the ITEs is given by \[ S^2_{(\tau)} := \frac{1}{n - 1} \sum_{i=1}^{n} (\tau_i - \tau)^2 \]

Lemma 4.1

\[ 2 S_{(1)(0)} = S^2_{(1)} + S^2_{(2)} - 2 S^2_{(\tau)}\]

Causality

Notation

  • Now, remember what we actually observe: for any unit i, we only either observe Yi(1) or Yi(0), never both.

  • So, it seems natural to introduce some sample quantities (in contrast to the population quantities we defined above).

  • For example, suppose we want to compute the average observed response value among those on the treatment.

    • Using our assignment indicator Zi, we can cleverly define our sample means as:

\[ \widehat{\overline{Y^{(1)}}} := \frac{1}{n_1} \sum_{i=1}^{n} Z_i Y_i ; \qquad \widehat{\overline{Y^{(0)}}} := \frac{1}{n_0} \sum_{i=1}^{n} (1 - Z_i) Y_i \]

Causality

Notation

  • Allow me to expound upon this a bit further.

  • Recall that Yi denotes the observed response of the ith unit.

    • We are encoding information about whether this is an on- or off-treatment measurement using our assignment indicator Zi.
  • Now, our first sum does technically range over all indices i from 0 to n.

    • However, for any response values corresponding to an off-treatment observation, the associated indicator will be 0 and our sum effectively only ranges over the indices for on-treatment units.
  • Though this may seem convoluted at first, it is actually a very neat way to succinctly express our sample averages!

Causality

Notation

  • With this in mind, we can define our sample variances:

\[\begin{align*} \widehat{S}^2_{(1)} & := \frac{1}{n_1 - 1} \sum_{i=1}^{n} Z_i \left[ Y_i - \widehat{\overline{Y^{(1)}}} \right]^2 \\ \widehat{S}^2_{(0)} & := \frac{1}{n_0 - 1} \sum_{i=1}^{n} (1 - Z_i) \left[ Y_i - \widehat{\overline{Y^{(0)}}} \right]^2 \end{align*}\]

Check Your Understanding

How might we define the sample covariance (if possible)?

Causality

Taking Stock

  • Whew, that’s a lot of setup! Before we proceed, let’s quickly take stock of what we’ve done.

  • For each unit i, we have associated potential outcomes Yi(1) and Yi(0) indicating response values on- and off-treatment, respectively.

    • The fundamental problem of causal inference is that, for any i, we only observe one of these.
  • This fundamental problem poses challenges for estimating the individual treatment effects (ITE) 𝜏i := Yi(1) - Yi(0) or the average causal effect (ATE) 𝜏.

    • Hence, we would like to develop an estimate for 𝜏.

Causality

Taking Stock

  • (Most) Every estimation problem requires assumptions: our assumptions are SUTVA, and that we are in the context of a CRE.

  • We define the population means, population standard deviations, and population covariance as on a few slides ago.

    • We also define sample analogs for some of these, also as outlined a few slides ago.
  • We are now in a position to posit an estimator for the ATE!

    • This particular estimate is due to Jerzy Spława-Neyman, one of the foremost statisticians of the 20th century.

Causality

Estimated ATE

Theorem

  1. Under a CRE (and assuming SUTVA), an unbiased estimator for the ACE is given by \[ \widehat{\tau} := \widehat{\overline{Y^{(1)}}} - \widehat{\overline{Y^{(0)}}} \]

  2. The variance of this estimator is given by

\[\begin{align*} \mathrm{Var}(\widehat{\tau}) & = \frac{S_{(1)}^2}{n_1} + \frac{S_{(0)}^2}{n_0} - \frac{S_{(\tau)}^2}{n} \\ & = \frac{n_0}{n_1 n} S_{(1)}^2 + \frac{n_1}{n_0 n} S_{(0)}^2 + \frac{2}{n} S_{(1)(0)} \end{align*}\]

  • Note that (as is typical with estimators), the variance of our estimator depends on population parameters.

Causality

Estimated Variance of the Estimated ATE

Theorem

Define the following estimator for the variance of \(\widehat{\tau}\): \[ \widehat{V} := \frac{\widehat{S}_{(1)}^2}{n_1} + \frac{\widehat{S}_{(0)}^2}{n_0} \] This estimate is conservative for estimating \(\mathrm{Var}(\widehat{\tau})\) is the sense that \[ \mathbb{E}[\widehat{V}] - \mathrm{Var}(\widehat{\tau}) = \frac{S^2_{(\tau)}}{n} \geq 0 \]

Causality

Proofs

  • I think it may be useful to go through the proof of the first part of the theorem.
    • Let’s do so on the board, together!

Causality

Theorem

Theorem

\[ \frac{\widehat{\tau} - \tau}{\sqrt{\Var(\widehat{\tau})}} \rightsquigarrow \mathcal{N}(0, 1) \]

  • This allows us to perform hypothesis testing, construct confidence intervals, etc.

Confounding

  • Let’s also quickly discuss how confounding enters our model.

  • Up until now, we’ve essentially been assuming the absence of confounders.

    • Specifically, this is reflected in our CRE assumption; equivalently, in our assumption that the assignment indicators follow i.i.d. Bernoulli distributions.
  • We can propose a slightly more causal-inference-specific definition for a confounding variable, as a variable that affects both treatment and the response.

G confounder confounder treatment treatment confounder->treatment response response confounder->response treatment->response

Confounding

  • Because the confounder affects the treatment, it will affect our assignment indicator Zi.

  • Perhaps it’s useful to (again) think back to our Aspirin/headache example. One possible confounder might be level of exercise - heavy exercise will definitely affect pain levels, but it will also affect how likely someone is to take Aspirin.

    • Hence, level of exercise is likely a confounding variable.
  • So, here’s an idea: given a confounding variable X, why don’t we run a logistic regression of Zi onto X?

    • The resulting probability will be an estimate of the true assignment probability πi := ℙ(Zi = 1 | X), which we sometimes call a propensity score.

Confounding

Propensity Scores

  • There’s a lot we can do with propensity scores!
    • We can perform Inverse-Propensity Weighting, a form of bias correction (like the Inverse-Probability Weighting scheme we discussed in Lab a few weeks back)
    • We can condition on these propensity scores to mitigate the effects of confounders.
  • A popular technique for overcoming the effects of confounding is called matching, in which individuals from the on-treatment group are matched with “similar” individuals from the off-treatment group.
    • Similarity is dictated by the confounding values; e.g. we pick someone with the same age, sex, race as, say, John

Confounding

Propensity Scores

  • Naturally, when there are many confounders, matching based on the raw confounder values can be very challenging and results in significant data loss.

  • A clever idea is to match based not on the raw values of the confounding variables, but rather the propensity scores.

  • I encourage you to read more in [A First Course in Causal Inference] by Peng Ding, if you are interested.

  • To close out, I’ll briefly outline a relatively famous case study.

National Supported Work Demonstration

A Quick Case Study

  • The National Supported Work Demonstration (NSW) was a employment program that ran between March 1975 and June 1977.

  • Essentially, the program offered employment training to participants in the hopes of decreasing disparities.

  • Initial findings seemed to indicate that those who underwent the training had lower average incomes than those who did not - as such, on the surface, it seemed like the training actually hurt people’s chances of high-level employment later in life.

  • In 1986, Robert J. LaLonde conducted a causal analysis of the findings of the study.

    • You can read his full paper here.

National Supported Work Demonstration

A Quick Case Study

  • LaLonde primarily pointed out that the original study was flawed in that it did not appropriately consider confounding variables!
    • For example, things like gender, race, and education level are likely confounders in this experiment as they affect both treatment (whether people were administered the training or not) and response (income levels).
  • Subsequent studies have shown that, after appropriately matching based on propensity scores, it can be shown that the training actually had a net positive causal effect.

Next Time

  • Please fill out Course Evaluations!

  • I’ll release the Bonus Lab (Lab 11) sometime today or tomorrow.

    • It will be due at 11:59pm on Friday, August 1, 2025.
  • Tomorrow will be our final lecture (I’ll also bring some hex stickers with the PSTAT 100 course logo for everyone tomorrow!)

  • During Section tomorrow, you can either work on the bonus lab or work on the project.

    • A friendly reminder that Erika (our TA) is a great resource to talk to!