Statistical Visalizations, Part I
Department of Statistics and Applied Probability; UCSB
Summer Session A, 2025
Make sure you render your lab to a PDF, and submit the PDF to gradescope (I’ll quickly show you how to do this on the server now).
FOLLOW THE INSTRUCTIONS ON THE LAB! (I already see some Gradescope submissions that haven’t followed all the instructions, especially when it comes to adding/removing names to your lab submission).
Please also don’t forget to submit by 11:59pm tonight (Wednesday) on Gradescope!
Data can be highly informative.
The information it provides, however, is oftentimes not immediately apparent.
One of the most important ways we can accomplish this is by producing appropriate summaries of our data.
Check your Understanding
What type of variable is standings
(i.e. what is its classification)?
Freshman | Sophomore | Junior | Senior |
---|---|---|---|
5 | 5 | 6 | 2 |
How can we convert this to a graphical summary?
Here’s one idea: draw four rectangles (i.e. “bars”), one for each of the four possible standings.
We can make the height of each bar proportional to the corresponding frequency
standings
DatasetThis type of plot is called a barplot (or bargraph), and is the ideal visualization for a categorical variable.
In general, for a categorical variable with k categories C1 through Ck with corresponding frequencies f1 through fk, the resulting barplot will have k bars with the height of the ith bar given by fi.
Stick with a barplot!
If you really desire a desert-themed plot, consider a donut plot:
So, that takes care of what type of plot to make when we have a single categorical variable. What about when we have a single numerical value?
As another concrete example, consider the following mock dataset comprised of exam scores (reported as a percentage between 0 and 100):
There are different conventions for edge cases, but the most common is to have left-inclusive intervals.
By the way, we no longer call this table a frequency table; instead we call it a distribution table
We can, however, treat the distribution table in a similar manner to a frequency table: construct as many bars as we have cells, with heights proportional to the counts within each cell.
Example: 100 people were asked to run one mile; their completion times (in minutes) were recorded, and the following boxplot was generated:
Most datasets are comprised of more than just one variable. As such, a common question among Data Scientists is: how do the different variables in a given dataset relate to one another?
We’ll tackle the case of comparing two variables today, and save our multivariate considerations for later.
Even in the two-variable case, there are three subcases to consider:
Commute.Dist. | Commute.Time |
---|---|
0.5 | 3 |
1 | 2 |
1.5 | 4 |
2 | 6 |
2.5 | 8 |
When considering scatterplots, certain patterns may become apparent.
Such patterns are called trends.
Most trends can be classified along two axes: positive/negative, and linear/nonlinear.
A positive trend is observed when as x
increases so does y
; a negative trend is observed when as x
increases y
decreases.
A trend whose rate of change is constant is said to be linear; a trend whose rate of change is nonconstant is said to be nonlinear
y
vs. x
displays a positive linear trend, we would say that x
and y
have a positive linear association, or that x
and y
are positively linearly associated.Your Turn!
Turn to your neighbor(s), and come up with an example of a pair of variables you believe would exhibit a positive association, a pair that you believe would exhibit a negative association, and a pair you believe would exhibit no association.
04:00
ID | Group | Syst_BP |
---|---|---|
1 | Control | 145 |
2 | Control | 140 |
3 | Treatment | 120 |
4 | Control | 143 |
5 | Treatment | 115 |
6 | Treatment | 103 |
7 | Control | 146 |
8 | Treatment | 117 |
Ignoring the ID
variable, rows of our dataframe are once again pairs of objects.
Now, however, these pairs are not pairs of numbers; hence, plotting them on a Cartesian Coordinate system doesn’t make a whole lot of sense.
Nevertheless, if we so desire, we can generate something resembling a scatterplot, called a dotplot:
This type of plot is called a side-by-side boxplot.
In general, a side-by-side boxplot has as many boxplots as categories, with the structure of each boxplot governed by the distribution of the numerical variable within each category.
By the way, notice that we can still consider the notion of trend, even in a side-by-side boxplot!
Caution
Association does not imply causation.
Your Turn!
For each of the following scenarios, identify the type of graph you think is best.
04:00
Finally, we tackle the case of two categorical variables.
Instead of simulated data… let’s look at y’all’s data!
Animal | Number |
---|---|
Cats | Even |
Cats | Odd |
Dogs | Even |
Cats | Even |
Dogs | Odd |
Dogs | Odd |
Dogs | Odd |
Dogs | Even |
I asked you two questions: whether you prefer cats or dogs, and whether you prefer even or odd numbers.
Animal
and Number
) are categorical.But what does it mean to compare these variables?
We can’t even really make a dotplot.
Cats
, Even
), (Cats
, Odd
), (Dogs
, Even
), (Dogs
, Odd
).Sure, if some combinations of Animal
and Number
preferences were completely absent from the data, that would be something we could tell from the dotplot.
That’s not the case here, though; among all 25 points of data, all four combinations have been covered.
But, remember: even though it looks like there are only 4 plots on our dotplot, there are actually 25; many of them are stacked on top of each other.
So, wouldn’t it be nice to incorporate information on how many points are stacked on top of each other?
Now, we “cheated” a bit.
Specifically, we introduced information about the number of observations corresponding to each (Animal
, Number
) combination.
That is, in essence, we’ve included information on our plot about a third variable!
This is one of the strange things about comparing two categorical variables: it is essentially impossible to make such a comparison without resorting to including cross-tabulated values.
Violinplots
Hexagonal Heatmaps
Ridgeline Plot:
In tomorrow’s lecture, we’ll introduce a framework for producing graphics using computer softwares.
We’ll also discuss some multivariate plots (i.e. plots that incorporate information from more than 2 variables).
Finally, we’ll talk a little bit about color theory, and some principles of good visualizations.
Friendly Reminder: keep working on Homework 1!
Another Friendly Reminder: don’t forget to submit Lab 01 by 11:59pm tonight!
Final Friendly Reminder: please submit all required DSP paperwork ASAP (no later than tomorrow to ensure they get processed in time for the first ICA next week)
PSTAT 100 - Data Science: Concepts and Analysis, Summer 2025 with Ethan P. Marzban