Examples of Statistical Modeling
Department of Statistics and Applied Probability; UCSB
Summer Session A, 2025
\[ \newcommand\R{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\E}{\mathbb{E}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\F}{\mathcal{F}} \newcommand{\1}{1\!\!1} \newcommand{\comp}[1]{#1^{\complement}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\SD}{\mathrm{SD}} \newcommand{\vect}[1]{\vec{\boldsymbol{#1}}} \newcommand{\Cov}{\mathrm{Cov}} \DeclareMathOperator*{\argmin}{\mathrm{arg} \ \min} \]
We can think of a model as a mathematical or idealized representation of a system.
In statistical modeling, we adopt a three-step procedure:
There are two main types of statistical models: parametric and nonparametric.
Today, let’s explore two examples of modeling: one nonparametric, and one parametric.
Density Estimation
Framework
Data: A collection of numerical values; \(\vect{x} = (x_1, \cdots, x_n)\) we believe to be a realization of an i.i.d. random sample \(\vect{X} = (X_1, \cdots, X_n)\) taken from a distribution with density f().
Goal: to estimate the true value of f() at each point.
Important
There are actually two kinds of histograms: frequency histograms and density histograms.
If we let Bj denote the jth bin, and if we have n observations, the height of the jth bar on a density histogram is given by \[ \mathrm{height}_j = \frac{1}{nh} \sum_{i=1}^{n} 1 \! \! 1_{\{x_i \in B_j\}} \]
Recall that the sum of indicators is a succinct way of writing “the number of observations that satisfy a condition”.
\[ \widehat{f}(x) = \frac{1}{nh} \sum_{i=1}^{n} \sum_{j} 1 \! \! 1_{\{x \in B_j\}} 1 \! \! 1_{\{x_i \in B_j\}} \]
In other words, our estimate takes the input x, finds the bin to which x belongs, and returns the height of that bin.
We call this procedure fixed binning, since, for a given x, we simply identify which, out of a set of fixed bins, x belongs to.
This leads us to the idea of local binning, in which we allow the height at x to be a normalization of the count in a neighborhood of x of width h (as opposed to the count in one of a set of fixed bins).
\[ \widehat{f}_{\mathrm{lb}}(x) = \frac{1}{nh} \sum_{i=1}^{n} 1 \! \! 1_{\{ \left| x_i - x \right| < \frac{h}{2} \}} \]
\[ \widehat{f}_{\mathrm{lb}}(x) = \frac{1}{n} \sum_{i=1}^{n} \left[ \frac{1}{h} 1 \! \! 1_{\{ \left| x_i - x \right| < \frac{h}{2} \}} \right] \]
\[ \widehat{f}_{\mathrm{KDE}}(x) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{\lambda} K_{\lambda}(x, x_i) \]
In general, we can use any valid probability density function as a kernel.
The estimate defined on the previous slide is called a kernel density estimate (KDE), and the general procedure of estimating a density as a weighted local average (with weights given by a kernel) of counts is called kernel density estimation.
Gaussian KDE utilizes a Gaussian kernel:
\[ \widehat{f}_{\mathrm{GKDE}}(x) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{\lambda} \phi\left( \frac{x_i - x}{\lambda} \right) \]
Recall that, in the parametric setting of modeling, we express our model in terms of a series of estimable parameters.
For example, suppose we assume the weight Yi of a randomly-selected cat to follow a Normal distribution with mean µ and variance 1, independently across observations.
Yi \(\stackrel{\mathrm{i.i.d.}}{\sim}\) N (µ, 1)
Yi = µ + εi, for εi \(\stackrel{\mathrm{i.i.d.}}{\sim}\) N (0, 1)
This is precisely what is meant by Step 2 of the modeling procedure (sometimes called the model fitting stage): we identify estimators/estimates of the parameters that are ideal in some way.
In our coin tossing example, our choice of estimation was relatively straightforward: use the sample proportion as an estimator for the population proportion.
In many cases, however, there won’t necessarily be one obvious choice for parameter estimates.
In general, we can think of an estimator as a rule.
A loss function is a mathematical quantification of the consequence paid in estimating a parameter θ by an estimator δ(Y).
The risk is the average loss: \(\mathbb{E}_{Y}[\mathcal{L}(\theta, \delta(Y))]\).
An “optimal” estimator for θ is therefore one that minimizes the risk: \[ \widehat{\theta} := \argmin_{\theta} \left\{ \mathbb{E}_{Y}[\mathcal{L}(\theta, \delta(Y))] \right\} \]
As an illustration of this framework of estimation, let’s attempt to find the value c that “best” summarizes a set of observations (y1, …, yn).
As the distribution of Y is, in general, unknown, we can consider the empirical risk in place of the risk: \[ R(\theta) := \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}(y_i, c) \]
Again, our “best” summary will be the value of c that minimizes R(θ)
There are many choices for loss functions, each with pros and cons.
For numerical data, the two most common loss functions are:
Squared Error (aka L2)
\(\mathcal{L}(y_i, \theta) = (y_i - \theta)^2\)
Absolute Error (aka L1)
\(\mathcal{L}(y_i, \theta) = |y_i - \theta|\)
L1 loss tends to be more robust to outliers than L2 loss.
As an example of how loss functions help us construct estimators for parameters, let us consider the case of L2 loss.
Given data (y1, …, yn), our “optimal” (risk-minimizing) estimate for θ (under L2 loss) satisfies \[ \widehat{\theta}_n := \argmin_{\theta} \left\{ \frac{1}{n} \sum_{i=1}^{n} (y_i - \theta)^2 \right\} \]
\[\begin{align*} \frac{\partial}{\partial \theta} R(\theta) & = \frac{\partial}{\partial \theta} \left[ \frac{1}{n} \sum_{i=1}^{n} (y_i - \theta)^2 \right] \\ & = \frac{2}{n} \sum_{i=1}^{n} (\theta - y_i) = \frac{2}{n} \left[ n \theta - n \overline{y}_n \right] = 2(\theta - \overline{y}_n) \\ \implies 2(\widehat{\theta} - \overline{y}_n) & = 0 \ \implies \ \boxed{\widehat{\theta} = \overline{y}_n} \end{align*}\]
Chalkboard Example
Identify the summary statistic that minimizes the empirical risk under L1 loss.
Both of our examples up until now can be classified as “univariate” modeling, as they involve only one variable.
In data science, we are often interested in how two (or more) variables are related to one another.
As an example, it’s not difficult to surmise that houses built in different years sell for different prices.
As such, we can explore a dataset that tracks the median selling price of homes built in various years, as sold in King County (Washington State) between May 2014 and May 2015.
In other words: for every year xi we believe there is an associated “true” median selling price f(xi).
This true price is unobserved; what we actually observe is a median selling price yi that has been contaminated by some noise εi.
Mathematically: \(y_i = f(x_i) + \varepsilon_i\), for some zero-mean constant-variance random variable εi.
y
= f
( x
) + noise
y
: response variable
f
: signal function
x
: explanatory variable
y
is numerical, the model is called a regression model; if y
is categorical, the model is called a classification model.The noise
term can be thought of as a catch-all for any uncertainty present in our model.
Broadly speaking, uncertainty can be classified as either epistemic (aka “reducible”) or aleatoric (aka “irreducible”).
Epistemic uncertainty stems from a lack of knowledge about the world; with additional information, it could be reduced.
Aleatoric uncertainty, on the other hand, stems from randomness inherent in the world; no amount of additional information can reduce it.
As an example: errors arising from a misspecified model are epistemic, as, in theory, if the model were corrected, they would be eliminated.
On the other hand, measurement error is widely accepted as aleatoric; repeated measurements will not reduce the amount of measurement error.
Since the noise
term in our model captures the uncertainty present, we treat it as a zero-mean random variable.
Later, we’ll need to add some additional specificity to this (e.g. what distribution does it follow? What assumptions do we need to make about its variance?) - for now, we’ll leave things fairly general.
Prediction
Inference
PSTAT 100 - Data Science: Concepts and Analysis, Summer 2025 with Ethan P. Marzban