library(tidyverse)
Lab 00: Dataframe Basics
PSTAT 100: Spring 2024 (Instructor: Ethan P. Marzban)
This lab is long! Use the floating table of contents (at the top-right of the screen) to jump to sections as needed.
Required Packages
Lab Objectives
This lab covers the following topics:
- Basics of dataframes in
R
Relevant Textbook Chapters/Sections:
- Portions of Chapter 27 in R4DS
Recap of Dataframe Basics
Recall that two of the data structures in R
are dataframes and tibbles. (There are a few minor difference between tibbles and dataframes, but for the most part we can think of the two structures as equivalent.)
Loosely speaking, a dataframe is a tabular arrangement of values, consisting of rows and columns. We saw in the Intro 2 R lab that one way to create data frames is using the data.frame()
function:
<- data.frame(
my_df col1 = c(2, 4, 6),
col2 = c("hello", "happy", "world")
)
my_df
col1 col2
1 2 hello
2 4 happy
3 6 world
Note that data frames are created by columns, not by rows.
Accessing Values
Once we have a dataframe created, we might like to access different elements of said dataframe. There are several ways to do this.
Using Slicing/Indexing
If we have a dataframe called df
, the command df[i, j]
extracts the entry at the ith row and the jth column. For example:
1, 2] my_df[
[1] "hello"
We can select multiple columns and/or rows by passing in a vector of values on either side of the comma:
c(1, 2), 2] my_df[
[1] "hello" "happy"
If we want to extract all elements of row i, we can simply leave the column index blank:
1, ] my_df[
col1 col2
1 2 hello
If we want to extract all elements of column j, we can simply leave the row index blank:
2] my_df[,
[1] "hello" "happy" "world"
It is important to note that rows and/or columns extracted from dataframes are stored as vectors:
is.vector(my_df[,2])
[1] TRUE
A useful thing to note is that the synax a:b
, where a
and b
are integers satisfying a
< b
, generates the set of consecutive integers starting at a
and ending at b
:
3:10
[1] 3 4 5 6 7 8 9 10
Another way to generate sequences in R
is to use the seq()
function, which allows you to specify a start
value, a stop
value, and either the amount of space between successive values in the sequence or the number of elements to be included in the sequence:
seq(0, 1, by = 0.1)
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq(0, 1, length = 11)
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
seq(0, 1, length = 10)
[1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
[8] 0.7777778 0.8888889 1.0000000
Using Column Names
We can also access individual columns of a dataframe by using the $
operator, followed by the name of the column. For example:
$col1 my_df
[1] 2 4 6
We can further subset by indexing on the selected column (again, remember that columns extracted from a dataframe are stored as vectors):
$col1[2:3] my_df
[1] 4 6
If we want to select multiple columns by name, we cannot use the $
operator but must instead use the select()
function from the dplyr
package (contained in the tidyverse
):
select(my_df, c(col1, col2))
col1 col2
1 2 hello
2 4 happy
3 6 world
Modifying Values
Updating/Replacing Values
To replace an already-existing element in a dataframe with another value, we can access the value and the use the variable assignment operator (<-
) to overwrite the previous value. For example:
1, 2] <- "greetings"
my_df[ my_df
col1 col2
1 2 greetings
2 4 happy
3 6 world
Adding Columns
To add a column, simply use the $
syntax to pretend you were accessing the column (even though it doesn’t exist yet), and then use the variable assignment operator to pass in a set of values:
$col3 <- c("red", "green", "blue")
my_df my_df
col1 col2 col3
1 2 greetings red
2 4 happy green
3 6 world blue
What happens if we try and add a column that has more values than rows in our dataframe? Well, let’s see:
$col4 <- c(TRUE, FALSE, TRUE, FALSE) my_df
Error in `$<-.data.frame`(`*tmp*`, col4, value = c(TRUE, FALSE, TRUE, : replacement has 4 rows, data has 3
So, this is something important to note: when adding a column to a dataframe, you must ensure that the number of values you are adding is the same as the number of rows in the dataframe.
Say we really wanted to add a fourth colunmn to our my_df
dataframe, with the values c(TRUE, FALSE, TRUE, FALSE)
. We could simply add a fourth row of missing values (NA
) to the already-existing dataframe, and then append the column:
4,] <- c(NA, NA, NA)
my_df[$col4 <- c(TRUE, FALSE, TRUE, FALSE)
my_df my_df
col1 col2 col3 col4
1 2 greetings red TRUE
2 4 happy green FALSE
3 6 world blue TRUE
4 NA <NA> <NA> FALSE
There are pros and cons to doing this. On the one hand, we’ve successfully added all the values we wanted to into our new column. However, we have done so at the cost of injecting missingness into our data. Depending on what we plan to do with the dataframe this may or may not be a big deal- so, just think critically before doing something like this.
Changing Column Names
The column names of our my_df
dataframe are pretty uninformative. Let’s see if we can give the columns more interesting names!
To access the column names of a dataframe, we can use either names()
or colnames()
:
colnames(my_df)
[1] "col1" "col2" "col3" "col4"
To rename our columns, we can simply assign (using the variable assignment operator) a new list of names:
colnames(my_df) <- c("numbers", "words", "colors", "booleans")
my_df
numbers words colors booleans
1 2 greetings red TRUE
2 4 happy green FALSE
3 6 world blue TRUE
4 NA <NA> <NA> FALSE
We can, if we like, do something similar to assign names to the rows of our dataframe:
rownames(my_df) <- c("row1", "row2", "row3", "row4")
my_df
numbers words colors booleans
row1 2 greetings red TRUE
row2 4 happy green FALSE
row3 6 world blue TRUE
row4 NA <NA> <NA> FALSE
Some Selected Exercises
All problems refer to the following table, which is meant to represent a (fake) Ice Cream store’s earnings:
flavor |
ppu |
units_sold |
---|---|---|
chocolate |
1.5 | 400 |
vanilla |
1.5 | 200 |
ube |
2.0 | 250 |
strawberry |
1.5 | 300 |
Write a dataframe called
ice_cream_df
that stores the information in the table above.Extract the second and third elements of the second and third columns.
Suppose that a flavor was originally missing from the dataset: the store actually also sold a
mint_cc
flavor which has appu
(price per unit) value of 2.0, and sold 275 units. Update theice_cream_df
to incorporate this information.Append a column called
money_earned
, which lists the amount of money earned from sales of each flavor. (Assume that the money earned is simply the product of the price per unit and the number of units sold.)