Lab 00: Dataframe Basics

PSTAT 100: Spring 2024 (Instructor: Ethan P. Marzban)

Author

Ethan P. Marzban

Published

October 15, 2024

Tip

This lab is long! Use the floating table of contents (at the top-right of the screen) to jump to sections as needed.

Required Packages

library(tidyverse)

Lab Objectives

This lab covers the following topics:

Basics of dataframes in R

Relevant Textbook Chapters/Sections:

Portions of Chapter 27 in R4DS

Recap of Dataframe Basics

Recall that two of the data structures in R are dataframes and tibbles. (There are a few minor difference between tibbles and dataframes, but for the most part we can think of the two structures as equivalent.)

Loosely speaking, a dataframe is a tabular arrangement of values, consisting of rows and columns. We saw in the Intro 2 R lab that one way to create data frames is using the data.frame() function:

my_df <- data.frame(
  col1 = c(2, 4, 6),
  col2 = c("hello", "happy", "world")
)

my_df

  col1  col2
1    2 hello
2    4 happy
3    6 world

Note that data frames are created by columns, not by rows.

Accessing Values

Once we have a dataframe created, we might like to access different elements of said dataframe. There are several ways to do this.

Using Slicing/Indexing

If we have a dataframe called df, the command df[i, j] extracts the entry at the i^th row and the ^jth column. For example:

my_df[1, 2]

[1] "hello"

We can select multiple columns and/or rows by passing in a vector of values on either side of the comma:

my_df[c(1, 2), 2]

[1] "hello" "happy"

If we want to extract all elements of row i, we can simply leave the column index blank:

my_df[1, ]

  col1  col2
1    2 hello

If we want to extract all elements of column j, we can simply leave the row index blank:

my_df[, 2]

[1] "hello" "happy" "world"

Note

It is important to note that rows and/or columns extracted from dataframes are stored as vectors:

is.vector(my_df[,2])

[1] TRUE

A useful thing to note is that the synax a:b, where a and b are integers satisfying a < b, generates the set of consecutive integers starting at a and ending at b:

3:10

[1]  3  4  5  6  7  8  9 10

Another way to generate sequences in R is to use the seq() function, which allows you to specify a start value, a stop value, and either the amount of space between successive values in the sequence or the number of elements to be included in the sequence:

seq(0, 1, by = 0.1)

 [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

seq(0, 1, length = 11)

 [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

seq(0, 1, length = 10)

 [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
 [8] 0.7777778 0.8888889 1.0000000

Using Column Names

We can also access individual columns of a dataframe by using the $ operator, followed by the name of the column. For example:

my_df$col1

[1] 2 4 6

We can further subset by indexing on the selected column (again, remember that columns extracted from a dataframe are stored as vectors):

my_df$col1[2:3]

[1] 4 6

If we want to select multiple columns by name, we cannot use the $ operator but must instead use the select() function from the dplyr package (contained in the tidyverse):

select(my_df, c(col1, col2))

  col1  col2
1    2 hello
2    4 happy
3    6 world

Modifying Values

Updating/Replacing Values

To replace an already-existing element in a dataframe with another value, we can access the value and the use the variable assignment operator (<-) to overwrite the previous value. For example:

my_df[1, 2] <- "greetings"
my_df

  col1      col2
1    2 greetings
2    4     happy
3    6     world

Adding Columns

To add a column, simply use the $ syntax to pretend you were accessing the column (even though it doesn’t exist yet), and then use the variable assignment operator to pass in a set of values:

my_df$col3 <- c("red", "green", "blue")
my_df

  col1      col2  col3
1    2 greetings   red
2    4     happy green
3    6     world  blue

What happens if we try and add a column that has more values than rows in our dataframe? Well, let’s see:

my_df$col4 <- c(TRUE, FALSE, TRUE, FALSE)

Error in `$<-.data.frame`(`*tmp*`, col4, value = c(TRUE, FALSE, TRUE, : replacement has 4 rows, data has 3

So, this is something important to note: when adding a column to a dataframe, you must ensure that the number of values you are adding is the same as the number of rows in the dataframe.

Say we really wanted to add a fourth colunmn to our my_df dataframe, with the values c(TRUE, FALSE, TRUE, FALSE). We could simply add a fourth row of missing values (NA) to the already-existing dataframe, and then append the column:

my_df[4,] <- c(NA, NA, NA)
my_df$col4 <- c(TRUE, FALSE, TRUE, FALSE)
my_df

  col1      col2  col3  col4
1    2 greetings   red  TRUE
2    4     happy green FALSE
3    6     world  blue  TRUE
4   NA      <NA>  <NA> FALSE

Caution

There are pros and cons to doing this. On the one hand, we’ve successfully added all the values we wanted to into our new column. However, we have done so at the cost of injecting missingness into our data. Depending on what we plan to do with the dataframe this may or may not be a big deal- so, just think critically before doing something like this.

Changing Column Names

The column names of our my_df dataframe are pretty uninformative. Let’s see if we can give the columns more interesting names!

To access the column names of a dataframe, we can use either names() or colnames():

colnames(my_df)

[1] "col1" "col2" "col3" "col4"

To rename our columns, we can simply assign (using the variable assignment operator) a new list of names:

colnames(my_df) <- c("numbers", "words", "colors", "booleans")
my_df

  numbers     words colors booleans
1       2 greetings    red     TRUE
2       4     happy  green    FALSE
3       6     world   blue     TRUE
4      NA      <NA>   <NA>    FALSE

We can, if we like, do something similar to assign names to the rows of our dataframe:

rownames(my_df) <- c("row1", "row2", "row3", "row4")
my_df

     numbers     words colors booleans
row1       2 greetings    red     TRUE
row2       4     happy  green    FALSE
row3       6     world   blue     TRUE
row4      NA      <NA>   <NA>    FALSE

Some Selected Exercises

All problems refer to the following table, which is meant to represent a (fake) Ice Cream store’s earnings:

`flavor`	`ppu`	`units_sold`
`chocolate`	1.5	400
`vanilla`	1.5	200
`ube`	2.0	250
`strawberry`	1.5	300

Write a dataframe called ice_cream_df that stores the information in the table above.
Extract the second and third elements of the second and third columns.
Suppose that a flavor was originally missing from the dataset: the store actually also sold a mint_cc flavor which has a ppu (price per unit) value of 2.0, and sold 275 units. Update the ice_cream_df to incorporate this information.
Append a column called money_earned, which lists the amount of money earned from sales of each flavor. (Assume that the money earned is simply the product of the price per unit and the number of units sold.)