Project Walkthrough

Data Science Collaborative; Spring 2025

Ethan P. Marzban

Department of Statistics and Applied Probability; UCSB

April 24, 2025

Data Science Lifecycle

First Version

  • The Data Science Lifecycle (DSL) seeks to describe the general lifecycle of a typical data science project.

  • Four main stages: questioning, collecting, analyzing, and interpreting.

  • Lots of variations of the DSL, some with more steps than others.

  • Main idea: data science projects are highly iterative.

Data Science Lifecycle

Second Version

  • In my opinion, this is a much better representation of the DSL! (Though the graphic is certainly too complex.)

  • Again, data science is a highly iterative field; we rarely proceed in a linear fashion from start to finish.

    • Rather, we start, analyze out data, realize we need more data, collect more data, analyze our new data, realize we need to revise our original question, etc.

Data Science Lifecycle

Starting the Cycle

  • Sometimes we’ll begin with a question we want to answer.

  • E.g. “Has air quality in the US improved over time?”

  • E.g. “How has the distribution of wealth and income changed since the economic recession of 2008?”

Data Science Lifecycle

Starting the Cycle

  • In this case, our question will dictate what kind of data to collect
    • E.g. AQI data
    • E.g. Income Data; Federal Bank Data; etc.
  • In other cases, we’ll start with a dataset, which will then inform what question we want to ask.
    • Limitations in our dataset may also necessitate changes in our question; we’ll return to this point in a few lectures.

Data Science Lifecycle

Traveling Through the Cycle

  • Once we have our data, we need to analyze it.
  • This might involve data cleaning or data tidying; this could also involve producing appropriate visualizations.
    • At this stage, we may perform Exploratory Data Analysis (EDA).
    • In certain cases, we may find it useful to apply techniques from maching learning to better understand our dataset.

Data Science Lifecycle

Traveling Through the Cycle

  • Finally, we need to understand what our data is saying.
  • This will typically involve answering our question(s); oftentimes we’ll take things a step further and see if we can use our data to make sense of the world.
  • A key component of this stage of the DSL is producing some sort of a report or presentation.

Formulating a Research Question

Some Tips

  • Formulating a good research question is a balancing act.
    • On the one hand, your question should be specific enough to be answerable.
    • On the other, yes/no questions tend to make relatively uninteresting research questions
      • Sometimes, a collection of yes/no questions can be combined to create a more interesting research question.
  • Finally, make sure you are setting reasonable expectations with your question.
    • After all, we only have a finite number of hours in a day, and a finite number of days before the showcase!

Formulating a Research Question

First Example

Question: First Pass

Have global temperatures increased in the past decade?

  • Let’s start with positives.
    • It’s definitely applicable!
    • It includes a specific time frame for investigation (“past decade”)
  • However, it is a bit too specific.
    • All we need to do is plot temperature over time, and we will certainly see the answer to be “yes”.
    • We can be a bit more ambitious than this!

Formulating a Research Question

First Example

Question: Second Pass

How have global temperatures increased in the past decade?

  • This is a much “better” question.
    • Somewhat open-ended; leaves something to actually be done!
    • We can answer using a graph, using statistical hypothesis testing, or a wide array of other tools.

Formulating a Research Question

Second Example

Question: First Pass

What have been the effects of Global Warming in the past decade?

  • This is definitely not too specific; in fact, it might be a bit too nonspecific…

  • Now, this might be a perfectly good question to start with.

    • But, as you conduct your analyses, I’d encourage you to start fine-tuning the question a bit more.
    • Are you going to focus on temperature? Carbon emissions? The impact on wildlife? The impact on people?
    • Trying to answer all of these, as interesting as it may be, will not be feasible in the next few weeks.

Formulating a Research Question

Third Example

Question: First Pass

Do UCLA students tend to, on average, have longer commute times than UCSB students?

  • Let’s discuss this one together - what are your thoughts?

Collecting Data

  • The next stage in the DSL is to collect data.

  • Google is a great place to start!

  • Another popular site is Kaggle.

  • If you’re struggling with creating a research question, you can always start by finding an interesting dataset, and then formulating a question from that!

  • Let’s run through an example of that.