An Introduction to Neural Networks
Department of Statistics and Applied Probability; UCSB
Summer Session A, 2025
\[ \newcommand\R{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\E}{\mathbb{E}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\F}{\mathcal{F}} \newcommand{\1}{1\!\!1} \newcommand{\comp}[1]{#1^{\complement}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\SD}{\mathrm{SD}} \newcommand{\vect}[1]{\vec{\boldsymbol{#1}}} \newcommand{\tvect}[1]{\vec{\boldsymbol{#1}}^{\mathsf{T}}} \newcommand{\hvect}[1]{\widehat{\boldsymbol{#1}}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\tmat}[1]{\mathbf{#1}^{\mathsf{T}}} \newcommand{\Cov}{\mathrm{Cov}} \DeclareMathOperator*{\argmin}{\mathrm{arg} \ \min} \newcommand{\iid}{\stackrel{\mathrm{i.i.d.}}{\sim}} \]
Let’s, for the moment, go back to our last lecture before the ICA, when we talked about logistic regression.
Our idea was that a linear combination of our covariate functions could be mapped to a value in [0, 1] by way of a transformation; namely, the logistic function.
Diagramatically, we might represent this using the graph to the left.
This graph illustrates the “direct” relationship between the covariates and the output probability (albeit through the logistic function, not pictured on the graph).
By “graph,” we don’t mean the graph of a function but rather graph in the mathematical sense.
A graph is comprised of a collection of nodes (represented pictorially as circles) and edges (represented pictorially as lines connecting nodes).
We might imagine a situation in which, instead of having a binary output, we have a categorical output with more than two levels.
For example, back when we talked about PCA, we encountered the MNIST dataset in which each image belongs to one of 10 classes (each representing a digit).
We might update our diagram, then, to look something like that to the left.
For example, in the MNIST dataset, we would have (k = 10) output probabilities.
\[ y(\vect{x}, \vect{\beta}) = \beta_0 + \sum_{j=1}^{p} \beta_j \phi_j(\vect{x})\]
\[ y(\vect{x}, \vect{\beta}) = g\left( \beta_0 + \sum_{j=1}^{p} \beta_j \phi_j(\vect{x}) \right) = g \left( \tvect{\beta} \vect{\phi}(\vect{x}) \right) \]
These are all examples of artificial neural networks (often referred to simply as neural nets).
Broadly speaking, neural nets are mathematical models that are motivated by the functioning of the brain.
The general idea is to recursively construct a model, whereby outputs of one portion are used as inputs in another portion.
The examples of neural networks we’ve seen today so far are all examples of one-layer (sometimes called shallow) networks, in which we have just one input layer and one output layer, and no layers in between.
One input layer
One hidden layer
One output layer
One parameter per edge
Each node “feeds” into another one layer down; hence the name feedforward neural network (NN).
We first take a linear combination of our inputs, and pass them through an activation function σX().
Each of these resulting quantities is then treated as an input into the hidden layer, which then takes a linear combination of these values, and passes them through yet another activation function σZ().
The resulting values are again weightedly-averaged, passed through a final output activation function σZ(), the result of which is then treated as the outputs of our NN.
Input Layer:
X = [x1, …, xp]
Hidden Layer:
Z = [σX(Xα1), …, σX(XαM)]
Output Layer:
𝔼[Y] = σZ(Zβ)
\[ Y = f(X) =: (\sigma_Z \circ h_{\beta} \circ \sigma_X \circ h_\alpha)(X)\]
Each function in the composition is either known (like the activation functions) or linear (what I’ve called the h functions above).
So, two key things become apparent:
There are a few different versions of the so-called Universal Approximation Theorem, posited and proved by different people at different points in time, differing in the assumptions made.
The broadest version of this theorem states: a feedforward neural network with one hidden layer and a finite number of neurons can approximate any continuous function on a compact subset of ℝn to an arbitrary degree of closeness.
Crucially, though, the UAT does not tell us how to find such an approximation.
However, unlike with the simpler statistical models we considered (e.g. SLR), the minimization problems that arise in the context of training neural nets are often quite tricky and don’t always admit closed-form solutions.
As such, it is common to use iterative methods to solve the minimization problem.
One popular choice of such an algorithm is called gradient descent, which we’ll discuss in a bit.
Before we dive too deep into the general optimization, it may be useful to make concrete our goals in fitting a neural net.
That is: we begin with 50 input values x1 through x50.
What is it that we want out of our network?
So, here’s the trick: we imagine finding the values of the signal function at a very fine grid of points. The finer the grid, the “smoother” our final function will look.
That is, the output of our NN should be a set of values {y1, …, yK} for some large value K, where yk denotes the value of the signal at some point xk.
We take a linear combination of our 50 input values: \[ a_j^{(1)} = \sum_{i=1}^{50} w_{ij}^{(1)} x_i + w_{j0}^{(1)} \] and scale each by an activation function: \[ z_j^{(1)} = \sigma_1 (a_j^{(1)}) \]
The \(z_j^{(1)}\) then form the input to our hidden layer, and receive a similar treatment as the original xi did.
That is: \[ a_k^{(2)} = \sum_{j=1}^{3} w_{jk}^{(2)} z_{j}^{(1)} + w_{k0}^{(2)} \] for k = 1, …, K (we can, somewhat aribtrarily, take K to be 1000).
Finally, these \(a_k^{(2)}\) are transformed by way of an output activation function to obtain the K output values: \[ y_k = \sigma_2 (a_{k}^{(2)}) \]
Let’s take a step back and examine the various components of this NN.
First, there is the matter of selecting the two activation functions σ1() and σ2().
Then, there is the matter of estimating the parameters \[ \begin{align*} \{w_{j0}^{(1)}, \ w_{j1}^{(1)}, \ \cdots, \ w_{j50}^{(1)} \} & \quad j = 1, 2, 3 \\ \{w_{k0}^{(2)}, \ w_{k1}^{(2)}, \ w_{k3}^{(2)} \} & \quad k = 1, \cdots, K \\ \end{align*} \]
For (K = 1000) outputs, this is a total of 453 parameters to estimate.
As you can imagine, the number of parameters in an arbitrary NN can be astronomically large.
In general, if we consider a “vanilla” NN (one hidden layer) with D input values, M hidden nodes, and K outputs, the parameters to estimate becomes \[ \begin{align*} \{w_{j0}^{(1)}, \ w_{j1}^{(1)}, \ \cdots, \ w_{jD}^{(1)} \} & \quad j = 1, \cdots, M \\ \{w_{k0}^{(2)}, \ w_{k1}^{(2)}, \ \cdots, \ w_{kM}^{(2)} \} & \quad k = 1, \cdots, K \\ \end{align*} \] leaving a total of [M (D + 1) + K (M + 1)] = M (D + K + 1) + K parameters.
And this is all with only one hidden layer; for Deep Neural Nets, it is not uncommon for the number of parameters to surpass a million (or even a billion, in cases)!
Neural Networks are perhaps not as “recent” a phenomenon as people think - the earliest neural network model was the perceptron model, proposed by Frank Rosenblatt back in 1957.
A fair amount of work was dedicated towards Neural Networks through the turn of the millennium, with a “burst” in publications through the 90s and into the early 2000s.
Work on Neural Networks slowed a bit, however, mainly due to the intense computational challenges involved in training them.
Near 2010, however, advancements in computational power drove what is considered to be the most recent (“second” or “third”, depending on who you ask) wave of interest in Neural Networks.
Since around 2016, the number of publications relating to Neural Networks has skyrocketed, and NNs remain a popular area of research to this day.
This most recent wave of interest has been accompanied with interest in the newly-minted field of deep learning, which, again, has only been feasible to research thanks to recent and continued advancements in computing.
Figure 1.16 from Deep Learning by Bishop and Bishop
R
is truly (and this is not just my personal opinion) superior.Gradient Descent (GD) is an algorithm that can be used to identify local minima of (possibly multivariate) functions.
You can imagine why this is useful: sometimes, minimization problems don’t admit closed-form expressions and as such we may need to resort to iterative algorithms to solve them.
The basic idea is as follows: the gradient (which we can think of as a multivariate derivative) gives the direction of greatest increase.
Perhaps a one-dimensional illustration may help:
More formally, in the case of a univariate function f(x), we start with an initial “guess” x1.
Then, we iteratively define xi = xi-1 - α f’(xi) for some step size
We stop the algorithm once the difference between xi and xi-1 is small.
The step size is fairly important: if it is too small, the algorithm may take a long time to converge. If it is too large, we may “overshoot” the minimum.
Even if our algorithm converges, we need to be cautious that it may have converged at a local minimum.
Example: f(x) = x2
xprev <- -1 ## initialize
h <- 0.1 ## step size
tol <- 10e-6 ## tolerance for convergence
iter <- 0 ## track the number of iterations
itermax <- 100 ## cap the number of iterations
repeat{
iter <- iter + 1
xnew <- xprev - h * (2*xprev) ## update step
if(abs(xnew - xprev) <= tol) {
break ## convergence condition
} else {
xprev <- xnew ## update and restart
}
if(iter >= itermax){
stop("Maximum Number of Iterations Exceeded")
}
}
cat("Final Answer:", xnew, "\n", "Iterations:", iter)
Final Answer: -3.484491e-05
Iterations: 46
Example: f(x) = x2
xprev <- -1 ## initialize
h <- 0.05 ## step size
tol <- 10e-6 ## tolerance for convergence
iter <- 0 ## track the number of iterations
itermax <- 100 ## cap the number of iterations
repeat{
iter <- iter + 1
xnew <- xprev - h * (2*xprev) ## update step
if(abs(xnew - xprev) <= tol) {
break ## convergence condition
} else {
xprev <- xnew ## update and restart
}
if(iter >= itermax){
stop("Maximum Number of Iterations Exceeded")
}
}
cat("Final Answer:", xnew, "\n", "Iterations:", iter)
Final Answer: -8.46415e-05
Iterations: 89
Example: f(x) = x2
xprev <- -1 ## initialize
h <- 1 ## step size
tol <- 10e-6 ## tolerance for convergence
iter <- 0 ## track the number of iterations
itermax <- 100 ## cap the number of iterations
repeat{
iter <- iter + 1
xnew <- xprev - h * (2*xprev) ## update step
if(abs(xnew - xprev) <= tol) {
break ## convergence condition
} else {
xprev <- xnew ## update and restart
}
if(iter >= itermax){
stop("Maximum Number of Iterations Exceeded")
}
}
Error: Maximum Number of Iterations Exceeded
If we have a multivariate function \(f(\vect{x})\), the gradient serves the role of the derivative: \[ \vec{\boldsymbol{\nabla}} f(\vect{x}) := \begin{bmatrix} \frac{\partial}{\partial x_1}f(\vect{x}) \\ \vdots \\ \frac{\partial}{\partial x_n}f(\vect{x}) \\ \end{bmatrix} \]
Our GD algorithm is relatively straightforward to update:
Initialize a starting vector \(\vect{x}^{(1)}\)
At step s, update according to \(\vect{x}^{(s)} = \vect{x}^{(s - 1)} - \alpha \vec{\boldsymbol{\nabla}}f(\vect{x}^{(s)})\)
Iterate until \(\|\vect{x}^{(s)} - \vect{x}^{(s - 1)}\|^2\) is small.
As an example, consider \[ f(x, y) = x^2 + y^2 \]
We’ll compute the gradient on the board
xprev <- c(-1, -1); h <- 0.8; tol <- 10e-8
iter <- 0; itermax <- 100
repeat{
iter <- iter + 1
xnew <- xprev - h * xprev
if(sum((xnew - xprev)^2) <= tol) { break }
else { xprev <- xnew }
if(iter >= itermax){
stop("Maximum Number of Iterations Exceeded")
}
}
cat("Final Answer:", xnew, "\n", "Iterations:", iter)
Final Answer: -1.28e-05 -1.28e-05
Iterations: 7
xprev <- c(-1, -1); h <- 1; tol <- 10e-8
iter <- 0; itermax <- 100
repeat{
iter <- iter + 1
xnew <- xprev - h * xprev
if(sum((xnew - xprev)^2) <= tol) { break }
else { xprev <- xnew }
if(iter >= itermax){
stop("Maximum Number of Iterations Exceeded")
}
}
cat("Final Answer:", xnew, "\n", "Iterations:", iter)
Final Answer: 0 0
Iterations: 2
So, to summarize: if we want to minimize a function, we can use Gradient Descent to descend along the graph of the function in the direction opposite to the gradient.
At each step, we take a step of size α.
We are only guaranteed convergence at a local minimum, not necessarily a global minimum.
So, why did I bring this up now?
Well, recall where we left off in our discussion on Neural Networks: we said that estimating the parameters can be accomplished by minimizing an appropriate loss function.
Indeed, most popular choices for loss functions lead to minimization problems that do not admit analytic solutions.
Explicitly computing the necessary gradients and partial derivatives leads to what is known as the backpropagation algorithm, which remains a very popular method for parameter estimation in Neural Networks.
In lab today, it will be Labubu time
Tomorrow, we’ll delve a bit into Causal Inference
Please keep working on your projects!
PSTAT 100 - Data Science: Concepts and Analysis, Summer 2025 with Ethan P. Marzban