Data Science Ethics
Department of Statistics and Applied Probability; UCSB
Summer Session A, 2025
\[ \newcommand\R{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\E}{\mathbb{E}} \newcommand{\Prob}{\mathbb{P}} \newcommand{\F}{\mathcal{F}} \newcommand{\1}{1\!\!1} \newcommand{\comp}[1]{#1^{\complement}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\SD}{\mathrm{SD}} \newcommand{\vect}[1]{\vec{\boldsymbol{#1}}} \newcommand{\tvect}[1]{\vec{\boldsymbol{#1}}^{\mathsf{T}}} \newcommand{\hvect}[1]{\widehat{\boldsymbol{#1}}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\tmat}[1]{\mathbf{#1}^{\mathsf{T}}} \newcommand{\Cov}{\mathrm{Cov}} \DeclareMathOperator*{\argmin}{\mathrm{arg} \ \min} \newcommand{\iid}{\stackrel{\mathrm{i.i.d.}}{\sim}} \]
“a set of moral principles : a theory or system of moral values” (source)
One of the foremost statisticians of the 1900s was Ronald Aylmer Fisher.
There is some debate surrounding the term “eugenics,” especially in light of it’s usage within the Nazi party.
Regardless, it was a purportedly “scientific” field that ultimately lead to the subjugation of many groups.
This illustrates a variant of something known as p-hacking
As we’ve seen, a p value is essentially a measure of evidence against a null and in favor of an alternative.
We’ve also seen that they are not immune to the effects of randomness
Let’s Discuss!
What are some potential ramifications of p-hacking? Why might we consider p-hacking “unethical”?
As another example of bad (or, at least, needs-improvement) statistical practice, let’s consider a pretty recent set of considerations due to Drs. Olivia McGough, Daniela Witten and Daniel Kessler
Essentially, it is somewhat common statistical practice to only report “interesting” (i.e. significant) results, and to simply not report results that are not significant.
In the context of regression (specifically, determining which covariates in an MLR model are statistically significant), the authors call this practice F-Screening.
The authors point out that F-screening can actually cause us to lose standard theoretical guarantees
They propose modified versions of standard statistical tests and practices to mitigate against these negative effects.
The paper itself is quite well-written, and I encourage you to read through it (provided you have some 120B and 126 knowledge).
difficult of major
actually created apparent biases that didn’t actually existHopefully these two examples (plus the examples outlined in the assigned reading for today’s lecture) demonstrate the need for a unified set of principles of data science ethics
Indeed, there have been several attempts at producing such a set of principles.
Two particularly famous ones (that we will discuss today):
Use data to improve life for our users, customers, organizations, and communities
Create reproducible and extensible work
Build teams with diverse ideas, backgrouns, and strengths
Prioritize the continuous collection and availability of discussions and metadata
Clearly identify the questions and objectives that drive each project and use to guide both planning and refinement.
Be open to changing our methods and conclusions in response to new knowledge.
Recognize and mitigate bias in ourselves and in the data we use.
Present our work in ways that empower others to make better-informed decisions.
Consider carefully the ethical implications of choices we make when using data, and the impacts of our work on individuals and society.
Respect and invite fair criticism while promoting the identification and open discussion of errors, risks, and unintended consequences of our work.
Protect the privacy and security of individuals represented in our data.
Help others to understand the most useful and appropriate applications of data to solve real-world problems.
Your Turn! (Exercise 1 from Chapter 8 of our Textbook)
A researcher is interested in the relationship of weather to sentiment (positivity or negativity of posts) on Twitter. They want to scrape data from https://www.wunderground.com and join that to Tweets in that geographic area at a particular time. One complication is that Weather Underground limits the number of data points that can be downloaded for free using their API (application program interface). The researcher sets up six free accounts to allow them to collect the data they want in a shorter time-frame. What ethical guidelines are violated by this approach to data scraping?
04:00
A term you will hear a lot throughout statistics and datascience is reproducibility
Essentially, reproducibile results are those that can be replicated (i.e. found again) given the same tools used to initially create them.
Part of ensuring your research is reproducible is documentation.
On a related note, as the authors of our textbook point out, “Data science professionals have an ethical obligation to use tools that are reliable, verifiable, and conducive to reproducible data analysis.”
As a somewhat extreme example, consider the programming language SAS: pharmaceutical researchers almost exclusively program in SAS. Why?
On the flipside, R
is touted (and often celebrated) as being very open source.
This has the distinct advantage of making R
cutting-edge, but has been the source of some criticism against the programming language itself.
For those unaware, the Comprehensive R
Archive Network (CRAN) is the widely-accepted sole database of “approved” R
packages
R
package uploaded to CRAN is very extensive, involving lots of checks, meaning that packages uploaded to CRAN are often very reliable. (CRAN Repository Policy)Another key concept in the realm of data science ethics is that of privacy.
In certain cases, the observational units (remember these?) of a particular dataset may not want to be uniquely revealed.
As such, you will often encounter data that has been privatized in some way (e.g. data that has been stripped of all potentially identifying information like name, gender, address, etc.)
Sometimes data is aggregated for privacy purposes.
To close out, I’d like to take a few minutes to explore a fairly famous (infamous?) case study.
In recent years, we have observed an increase in the use of machine learning algorithms to aid decisions about individuals in the justice system.
A major element of this is modeling recidivism rates
One algorithm/software used in this context is the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) system.
In 2016, ProPublica published a groundbreaking article highlighting several ways in which the COMPAS algorithm exhibited concerning signs of racial bias.
You can read the article here, and a more detailed breakdown of the authors’ analyses here.
With our PSTAT 100 knowledge, we actually have the tools to explore and replicate some of the authors’ findings; an endeavor we’ll start today in lecture, and you’ll finish on the Bonus Lab (should you choose to complete it).
First, let’s understand what the COMPAS system does.
Subjects are administered a 137-question questionnaire; the algorithm then takes their answers an returns a score of 1 - 10, indicating how likely (1 being the most unlikely, 10 being the most likely) to recidivate.
One of the things that ProPublica did was examine the COMPAS ratings of over 10,000 criminals in Broward County (located in Florida), and track how many of them actually recidivated or not.
Given that the COMPAS algorithm is essentially a classification model, this leads us naturally into considering our classification error rates
Check Your Understanding
In today’s Lab (the Bonus Lab), I ask you to delve a little deeper into ProPublica’s analysis.
One thing to note is that the COMPAS algorithm is considered proprietary and therefore is not released to the public.
“Black-box” algorithms like this are very prevalent in our modern-day society.
Dr. Cathy O’Neill has coined the term “Weapon of Math Destruction” (WMD) to refer to some of these types of algorithms.
In her book titled Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy, Dr. O’Neill proposes three metrics by which an algorithm can be classified as a WMD: Opacity (Are details available upon request?), Scale (Is the algorithm being used in broad and far-reaching situations?), and Damage (Who is hurt by this algorithm, and how?).
Let’s see how the COMPAS algorithm fares on this metric:
Hence, the COMPAS algorithm would likely count as a WMD.
Finally, I would be remiss to not at least mention ChatGPT
At its core, ChatGPT utilizes a model (algorithm).
OpenAI (the parent company of ChatGPT) provides information on how the model is trained.
Food For Thought
Is ChatGPT a Weapon of Math Destruction?
As data scientists, it can be easy for us to mentally remove ourselves from the context in which our research will be applied.
I encourage you to combat this - always try and make sure to have a sense on how your research may impact others.
A quote from the ProPublica article, from one of COMPAS’ original creators, Tim Brennan:
“I wanted to stay away from the courts,” Brennan said, explaining that his focus was on reducing crime rather than punishment. “But as time went on I started realizing that so many decisions are made,you know, in the courts. So I gradually softened on whether this could be used in the courts or not.” (ProPublica, 2016)
Tip
Awareness can go a long way.
So… where do we go from here?
Well, as I’ve said a few times, my hope was to structure PSTAT 100 like a “table of contents” of Data Science - giving you a brief introduction to a variety of topics.
The good news is that there is a lot more to learn about these topics!
Indeed, there exist several classes in our own department that you can look into taking if you’re interested to learn more.
The best thing about being a statistician is that you get to play in everyone’s backyard.
Please don’t forget to take a Hex Sticker!
PSTAT 100 - Data Science: Concepts and Analysis, Summer 2025 with Ethan P. Marzban