Project Walkthrough

Data Science Collaborative; Spring 2025

Ethan P. Marzban

Department of Statistics and Applied Probability; UCSB

April 24, 2025

Data Science Lifecycle

The Data Science Lifecycle (DSL) seeks to describe the general lifecycle of a typical data science project.
Four main stages: questioning, collecting, analyzing, and interpreting.
Lots of variations of the DSL, some with more steps than others.
Main idea: data science projects are highly iterative.

In my opinion, this is a much better representation of the DSL! (Though the graphic is certainly too complex.)
Again, data science is a highly iterative field; we rarely proceed in a linear fashion from start to finish.
- Rather, we start, analyze out data, realize we need more data, collect more data, analyze our new data, realize we need to revise our original question, etc.

Sometimes we’ll begin with a question we want to answer.
E.g. “Has air quality in the US improved over time?”
E.g. “How has the distribution of wealth and income changed since the economic recession of 2008?”

In this case, our question will dictate what kind of data to collect
- E.g. AQI data
- E.g. Income Data; Federal Bank Data; etc.
In other cases, we’ll start with a dataset, which will then inform what question we want to ask.
- Limitations in our dataset may also necessitate changes in our question; we’ll return to this point in a few lectures.

Once we have our data, we need to analyze it.
This might involve data cleaning or data tidying; this could also involve producing appropriate visualizations.
- At this stage, we may perform Exploratory Data Analysis (EDA).
- In certain cases, we may find it useful to apply techniques from maching learning to better understand our dataset.

Finally, we need to understand what our data is saying.
This will typically involve answering our question(s); oftentimes we’ll take things a step further and see if we can use our data to make sense of the world.
A key component of this stage of the DSL is producing some sort of a report or presentation.

Formulating a good research question is a balancing act.
- On the one hand, your question should be specific enough to be answerable.
- On the other, yes/no questions tend to make relatively uninteresting research questions
  - Sometimes, a collection of yes/no questions can be combined to create a more interesting research question.
Finally, make sure you are setting reasonable expectations with your question.
- After all, we only have a finite number of hours in a day, and a finite number of days before the showcase!

Question: First Pass

Have global temperatures increased in the past decade?

Let’s start with positives.
- It’s definitely applicable!
- It includes a specific time frame for investigation (“past decade”)
However, it is a bit too specific.
- All we need to do is plot temperature over time, and we will certainly see the answer to be “yes”.
- We can be a bit more ambitious than this!

Question: Second Pass

How have global temperatures increased in the past decade?

This is a much “better” question.
- Somewhat open-ended; leaves something to actually be done!
- We can answer using a graph, using statistical hypothesis testing, or a wide array of other tools.

Question: First Pass

What have been the effects of Global Warming in the past decade?

This is definitely not too specific; in fact, it might be a bit too nonspecific…
Now, this might be a perfectly good question to start with.
- But, as you conduct your analyses, I’d encourage you to start fine-tuning the question a bit more.
- Are you going to focus on temperature? Carbon emissions? The impact on wildlife? The impact on people?
- Trying to answer all of these, as interesting as it may be, will not be feasible in the next few weeks.

Question: First Pass

Do UCLA students tend to, on average, have longer commute times than UCSB students?

The next stage in the DSL is to collect data.
Google is a great place to start!
Another popular site is Kaggle.
If you’re struggling with creating a research question, you can always start by finding an interesting dataset, and then formulating a question from that!
Let’s run through an example of that.