Section 2 Packages, Data Files and tidyverse

In Lab 1, we have only used data that we manually entered into R. However, most of the time, we will load data from an external file (e.g., txt, csv, dta, and rda). Before interacting with data files, we must ensure they reside in the working directory, which is the location on your computer where R will, by default, load data from and save data to. To display the current working directory, use the function getwd() without providing any input.

getwd() # check what your current working directory is

It is also possible to change the working directory using the setwd() function by specifying the full path to the folder of our choice as a character string. However, rather than using setwd() in code scripts, I believe data analysis should be organized in projects. In this case, you should keep the working directory as it is and provide the full path in each of your projects.

In this course we will make use of data from the quantitative social science (qss) package, which is available on github. You can install it using the install_github() function from the devtools library (also called package). Below, we are assuming that you don’t have the devtools library installed in your R. Thus, we use the install.packages() and library() functions to first install and then load the package. (Note that while install.packages() requires the package’s name to be placed in quotes, library() does not require the quotes.)

install.packages("devtools")
library(devtools)
devtools::install_github("kosukeimai/qss-package")
library("qss")

It is good practice to load necessary libraries at the start of an R script. We will start by installing and loading the package tidyverse. The tidyverse is a collection of R packages designed for data science. It provides tools to simplify many common data wrangling, exploration, and visualization tasks. At the heart of the tidyverse is the principle of “tidy data,” which promotes a consistent structure for data sets where each variable is a column, each observation is a row, and each type of observational unit is a table. Key packages in the tidyverse include dplyr for data wrangling, ggplot2 for data visualization, tidyr for data tidying, readr for data import, purrr for functional programming, and tibble for tibble data structures (an evolution of R’s data frames), among others. These packages are designed to work together seamlessly, creating a comprehensive and coherent toolkit for data analysis. For more information on tidyverse, consult the book R for Data Science: Import, tidy, transform, visualize and model data.

devtools::install_github("tidyverse/tidyverse")
library(tidyverse)

We can also load these libraries individually.

library(dplyr)
library(readr)
library(ggplot2)

Datasets can be distributed with R packages. These are often smaller datasets used in examples and tutorials in packages. These are loaded with the data() function. For example you can load UN data on demographic statistics from the qss library, which distributes the data sets used in the QSS textbook. (The function data() called without any arguments will list all the datasets distributed with installed packages.)

data("UNpop", package = "qss")

Another way to access dataset in R is loading them from external files including both stored R objects (.RData, .rda) and other formats (.csv, .dta, .sav). To read a csv file into R, we use the read_csv() function from the readr library, part of the tidyverse.

UNpop_URL <- "https://raw.githubusercontent.com/kosukeimai/qss/master/INTRO/UNpop.csv"
UNpop <- read_csv(UNpop_URL)

Note that in the previous code we loaded the file directly from a URL, but we could also work with local files on your computer.

UNpop <- read_csv("INTRO/UNpop.csv")
## Error: 'INTRO/UNpop.csv' does not exist in current working directory.

Oops. It appears something went wrong. The issue is that R cannot locate the file UNpop.csv in your working directory. To resolve this, we will first learn how to save our UNpop dataset as a .csv file and then attempt to read it again.

write_csv(UNpop, file = "~/Downloads/UNpop.csv")

Remember to adjust the file path in the code above to match your directories. Also, for Windows users, the path separator is \\ (double backslash).

Now we can load our dataset UNpop from a local file on your computer.

UNpop <- read_csv("~/Downloads/UNpop.csv")

Now that we’ve loaded UNpop, let’s see what we have. We can think of a data.frame or tibble object as a spreadsheet. We can view a table-like representation of data.frame or tibble objects in RStudio by clicking on the object name in the Environment tab in the upper-right window (see figure 1.1). In tabular data like this, we often think of (and refer to) the columns as variables and the rows as observations. You will notice that we use the terms column and variable interchangeably, just as we use row and observation interchangeably.

Alternatively, we can view our data set using the View() function with the object name as the input argument. This will open a new tab displaying the data.

View(UNpop)

Useful functions for this object include the names() function to return a vector of variable names, the nrow() function to return the number of rows, the ncol() function to return the number of columns, and the dim() function to combine the outputs of ncol() and nrow() into a vector (also known as the dimensions of the data).

names(UNpop)  
## [1] "year"      "world.pop"
nrow(UNpop)
## [1] 7
ncol(UNpop)
## [1] 2
dim(UNpop)
## [1] 7 2

The function read_csv() to load a dataset also returns a tibble instead of a data frame. We can verify this by invoking the class() function on the object.

class(UNpop)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
UNpop
## # A tibble: 7 × 2
##    year world.pop
##   <dbl>     <dbl>
## 1  1950   2525779
## 2  1960   3026003
## 3  1970   3691173
## 4  1980   4449049
## 5  1990   5320817
## 6  2000   6127700
## 7  2010   6916183

The <dbl> annotation you see in the output when printing a tibble represents a “double” data type, which is a type of numerical data. Thus, the above output is telling us that both our variables year and world.pop are numerical data. We can check this by using the class() function for a particular variable within our dataset using the $ operator provided by the base R syntax to extract a variable from a data.frame object.

class(UNpop$year)
## [1] "numeric"

By using the $ operator, preceded by our dataset’s name and followed by the variable name of interest, we can extract information specific to that variable from the data.frame object. Thus, R returns a vector containing the desired variable.

UNpop$year
## [1] 1950 1960 1970 1980 1990 2000 2010

We could also use indexing [ ], as done for a vector. Since a data.frame object is a two-dimensional array, we need two indexes, one for rows and the other for columns. Using brackets with a comma [rows, columns] allows users to call specific rows and columns by either row/column numbers or row/column names. If we do not specify a row (column) index, then the syntax will return all rows (columns). Here are some syntax examples, which show how this indexing works.

# subset all rows for column called "world.pop" from UNpop data  
UNpop[, "world.pop"]  
## # A tibble: 7 × 1
##   world.pop
##       <dbl>
## 1   2525779
## 2   3026003
## 3   3691173
## 4   4449049
## 5   5320817
## 6   6127700
## 7   6916183
# subset the first three rows (and all columns)  
UNpop[c(1, 2, 3),] 
## # A tibble: 3 × 2
##    year world.pop
##   <dbl>     <dbl>
## 1  1950   2525779
## 2  1960   3026003
## 3  1970   3691173
# subset the first three rows of the "year" column  
UNpop[1:3, "year"]   
## # A tibble: 3 × 1
##    year
##   <dbl>
## 1  1950
## 2  1960
## 3  1970

In the tidyverse syntax, extracting subsets of data looks a bit different. Instead of using lots of brackets, we can use functions to select rows by number, to select rows by certain criteria, or to select columns. The slice() function returns rows (observations) by number (or other criteria). For example, to select rows 1–3 of our data set:

slice(UNpop, 1:3)
## # A tibble: 3 × 2
##    year world.pop
##   <dbl>     <dbl>
## 1  1950   2525779
## 2  1960   3026003
## 3  1970   3691173

The select() function returns columns (variables) by name, number, or other criteria. For example, to extract/subset the world.pop variable (column) within our data set we can use the select() function (remember to provide the name of your data set [object] in the first parameter of the function).

select(UNpop, world.pop)
## # A tibble: 7 × 1
##   world.pop
##       <dbl>
## 1   2525779
## 2   3026003
## 3   3691173
## 4   4449049
## 5   5320817
## 6   6127700
## 7   6916183

Notice that the base R syntax (see the code above using [ ] and $ operators) returns a vector, while the tidyverse syntax always returns a tibble (data frame), even if only one column is selected.

Let’s say we wanted to subset the first three rows just for the variable year. We could do that in any of the following ways—notice that we can nest a slice() command inside a select() command. R will select the year column and then slice the first three rows.

# base R subset the first three rows of the year variable  
UNpop[1:3, "year"]  
## # A tibble: 3 × 1
##    year
##   <dbl>
## 1  1950
## 2  1960
## 3  1970
# or in tidyverse, combining slice() and select()  
select(slice(UNpop, 1:3), year) 
## # A tibble: 3 × 1
##    year
##   <dbl>
## 1  1950
## 2  1960
## 3  1970

Instead of nesting the slice() into select(), we could use the pipe operator %>% in the tidyverse syntax, the pipe operator %>% is used to chain together multiple functions in a sequence of operations. When you use the pipe operator, the result of the expression or function to its left is used as the first argument to the function on its right. This allows you to write code in a more readable, left-to-right fashion, which can make complex sequences of operations easier to understand.

UNpop %>% # take the UNpop data we have loaded, and then...  
  slice(1:3) %>% # subset the first three rows, and then...  
  select(year) # subset the year column
## # A tibble: 3 × 1
##    year
##   <dbl>
## 1  1950
## 2  1960
## 3  1970

This example may seem verbose, but chaining functions together using the pipe operator, %>%, will become very useful as our tasks become more complicated.

As another subsetting example, imagine that we want to extract every other row of the world.pop column from UNpop (i.e., we want rows 1, 3, 5, etc. for the world.pop variable). We could use an additional helper function, n(), which returns the number of rows in the data.frame or tibble.

UNpop %>%  
  slice(seq(1, n(), by = 2)) %>% # using a sequence from 1 to n()  
  select(world.pop) 
## # A tibble: 4 × 1
##   world.pop
##       <dbl>
## 1   2525779
## 2   3691173
## 3   5320817
## 4   6916183

A final example of how to subset these specific rows and column uses the filter() function. filter() is to rows what select() is to columns—it subsets rows by name, order, or other criteria. In the example below, filter() says to subset rows if their row number divided by 2 gives a remainder of 1. The %% operator returns the modulus, i.e., division remainder. The function row_number() returns the row number of an observation.

UNpop %>%  
  filter(row_number() %% 2 == 1) %>%  
  select(world.pop)  
## # A tibble: 4 × 1
##   world.pop
##       <dbl>
## 1   2525779
## 2   3691173
## 3   5320817
## 4   6916183

The filter(row_number()%% 2 == 1) in the above code makes use of what is called a conditional or logical statement. We will discuss these more in depth in Week 3. For now, think of these as “if” conditions, telling R to do something if a condition is met. The condition might be that something is equal to something else, such as the modulus being equal to 1 as in the example above. The “equal to” condition is indicated in R with == (note that this is not the same as a single =). In the example, we are telling R to return the subset of rows where the modulus of dividing the row number by 2 is equal to 1 (in other words, returning the odd number observations 1, 3, 5, etc.). For conditional statements, we can also use the “less than” (<), “less than or equal to” (<=), “greater than” (>), and “greater than or equal to” (>=) syntax. An exclamation point in a conditional indicates negation, so != means “is not equal to.”

The following code uses filter(), select(), and a conditional statement with the function pull() to extract a specific value from our data as a vector instead of a tibble. Let’s say, for example, that we wanted to know what the world population was in 1970. We could use the following commands.

pop.1970 <- UNpop %>% # take the UNpop data and then...  
  filter(year == 1970) %>% # subset rows where year is equal to 1970  
  select(world.pop) %>% # subset just the world.pop column  
  pull() # return a vector, not a tibble  

#print the vector to the console to see it  
print(pop.1970) 
## [1] 3691173

2.1 Adding Variables

Suppose we wanted to take the population data and add an additional variable (column) based on a current variable. For example, perhaps we want to have the world population in millions, instead of the raw figure in the original data. We can use the mutate() function to create that variable, which we call world.pop.mill, and add it to the tibble. We can then drop the original world.pop variable using the select() function with a - and the column name. In the example below, we also use <- to save a new version of the data that contains the new column. Note that if we put the same object name on both sides of the <-, tha would overwrite the existing data. If you run the code below, you should have both UNpop and UNpop.mill in your Environment. You may want to look at the new data to confirm that the new variable is as you expect.

UNpop.mill <- UNpop %>% # create a new tibble from UNpop  
  mutate(world.pop.mill = world.pop/1000) %>%  # create new variable world.pop.mill 
  select(-world.pop) # drop the original world.pop column  

The mutate() function is a very useful command in tidyverse. We used it above to do an arithmetic operation on a column. We can also use it to combine columns based on our specifications, as in the example below. Let’s say we wanted a variable that took the world population and divided it by the year (why we would want to do this is unclear, but let’s go with it for now). The following code shows how we could do that by using the column names.

# adding a nonsense variable to the UNpop.mill data  
UNpop.mill <- UNpop.mill %>%  
  mutate(nonsense.var = world.pop.mill/year) 

We can combine the mutate() function with conditional statements in useful ways by using the function if_else(). The function tells R to do something if a conditional statement is met and to do something else if the statement is not met. Say we wanted a new variable that indicates whether or not a row contains data from after 1980. We’ll call this new variable after.1980. We want this variable to be dichotomous and, therefore, have two possible values: 1 if the row is from after 1980 and 0 if it’s not. It is usually a good idea to check that your new variable looks the way you expect.

# adding a variable with if_else  
UNpop.mill <- UNpop.mill %>%  
  mutate(after.1980 = if_else(year >= 1980, 1, 0)) 

Another example with mutate() and if_else() uses a very helpful conditional symbol: %in%. We can follow %in% with a vector of values. R will then check whether a specific value matches something in that vector. For example, let’s say that we also wanted to add a variable noting whether a row was from the following specific set of years (imagine that these years are of particular interest to us): 1950, 1980, and 2000. In the following code, we first create a vector of those years, then we reference it within if_else() to create a new variable, years.of.interest.

# creating a vector of the years of interest  
specific.years <- c(1950, 1980, 2000)  

# adding a variable with if_else and %in%  
UNpop.mill <- UNpop.mill %>%  
  mutate(year.of.interest = if_else(year %in% specific.years, 1, 0)) 

# Viewing your new data frame (tibble)
UNpop.mill
## # A tibble: 7 × 5
##    year world.pop.mill nonsense.var after.1980 year.of.interest
##   <dbl>          <dbl>        <dbl>      <dbl>            <dbl>
## 1  1950          2526.         1.30          0                1
## 2  1960          3026.         1.54          0                0
## 3  1970          3691.         1.87          0                0
## 4  1980          4449.         2.25          1                1
## 5  1990          5321.         2.67          1                0
## 6  2000          6128.         3.06          1                1
## 7  2010          6916.         3.44          1                0

2.2 Data Frames: Summarizing

Having loaded our data and created some new variables, we now turn to some ways to summarize the data. The summary() function is useful for this and to extract some descriptive statics from our data. The summary() function yields, for each variable in the input data.frame or tibble object, the minimum value, the first quartile (or 25th percentile), the median (or 50th percentile), the mean, the third quartile (or 75th percentile), and the maximum value. Remember, we can also use functions like mean() to compute summary values for specific variables in the data and nrow() to get the number of observations in a data frame. The number of observations and variables within a data frame can also be viewed in the upper-right pane (Environment/History tab) of RStudio.

summary(UNpop.mill) 
##       year      world.pop.mill  nonsense.var     after.1980    
##  Min.   :1950   Min.   :2526   Min.   :1.295   Min.   :0.0000  
##  1st Qu.:1965   1st Qu.:3359   1st Qu.:1.709   1st Qu.:0.0000  
##  Median :1980   Median :4449   Median :2.247   Median :1.0000  
##  Mean   :1980   Mean   :4580   Mean   :2.305   Mean   :0.5714  
##  3rd Qu.:1995   3rd Qu.:5724   3rd Qu.:2.869   3rd Qu.:1.0000  
##  Max.   :2010   Max.   :6916   Max.   :3.441   Max.   :1.0000  
##  year.of.interest
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4286  
##  3rd Qu.:1.0000  
##  Max.   :1.0000
mean(UNpop.mill$world.pop.mill) 
## [1] 4579.529
nrow(UNpop.mill)
## [1] 7

In R, missing values are represented by NA. When applied to an object with missing values, functions may or may not automatically remove those values before performing operations.

We will discuss the details of handling missing values later. Here we note that for many functions, like mean(), the argument na.rm = TRUE will remove missing data before operations occur. Below we use the add_row() function to add a row with NA values to demonstrate the issue that arises with mean() and how to fix it.

# add a row where value for each column is NA  
UNpop.mill.wNAs <- UNpop.mill %>%  
  add_row(year = NA, world.pop.mill = NA,  nonsense.var = NA, after.1980 = NA,  year.of.interest = NA)  

#take the mean of world.pop.mill (returns NA)
mean(UNpop.mill.wNAs$world.pop.mill)  
## [1] NA
#take the mean of world.pop.mill (ignores the NA) 
mean(UNpop.mill.wNAs$world.pop.mill, na.rm = TRUE)  
## [1] 4579.529

The tidyverse offers a useful way to generate summaries with the summarize() function (or summarise(), both spellings are accepted). With this function, you can specify multiple functions to apply to variables within a data set, returning the results as new columns in a tibble. The example below returns both the median and the mean of world.pop.mill.

UNpop.mill %>%  
  summarize(mean.pop = mean(world.pop.mill),  
            median.pop = median(world.pop.mill))  
## # A tibble: 1 × 2
##   mean.pop median.pop
##      <dbl>      <dbl>
## 1    4580.      4449.

What if we wanted to know the average (mean) world population but not for the full time period? For example, we might want to know what the average population was before 1980 and after 1980. To do this, we can combine the summarize() function with the group_by() function, which tells R to treat subsets of the data separately. We need to tell R which variable to use to group the rows. We can use the variable we created earlier, after.1980, for this purpose.

UNpop.mill %>%    
  group_by(after.1980) %>%  # create subset group for each value of after.1980  
  summarize(mean.pop = mean(world.pop.mill)) # calculate mean for each group
## # A tibble: 2 × 2
##   after.1980 mean.pop
##        <dbl>    <dbl>
## 1          0    3081.
## 2          1    5703.

2.3 Loading and Saving Data in Other Formats

Often we wish to load or save a data file produced by another statistical software program such as STATA or SPSS. The foreign, and haven packages are useful when dealing with files from other statistical software. Remember that you will first need to install and load the package before you can use it.

#install packages - note the syntax for multiple packages  
install.packages(c("foreign", "haven"))  
library("foreign") #load package  
library("haven")  

Once the packages are loaded, we can use the appropriate functions to load or save the data file. For example, the read.dta() and read.spss() functions can read STATA and SPSS data files, respectively (the following syntax below assumes the existence of the UNpop.dta and UNpop.sav files in the working directory).

read.dta("UNpop.dta")  
read.spss("UNpop.sav")  

Alternatively, you could use the read_dta() function from haven. Try it out!

It is also possible to save a data.frame object as a data file that can be directly loaded into another statistical software package. For example, the write.dta() function will save a data.frame object as a STATA data file. Or you can use the write_dta() function from haven.

write.dta(UNpop, file = "UNpop.dta")  
write_dta(UNpop, "UNpop.dta") 

2.4 Saving Objects

The objects we create in an R session will be temporarily saved in the workspace, which is the current working environment. As mentioned earlier, the ls() function displays the names of all objects currently stored in the workspace. In RStudio, all objects in the workspace appear in the Environment tab in the upper-right corner. However, these objects will be lost once we terminate the current session. This can be avoided if we save the workspace at the end of each session as an RData file. When we quit R, we will be asked whether we would like to save the workspace. We should answer no to this so that we get in the habit of explicitly saving only what we need (particularly, the R script). In fact, you should uncheck the options in RStudio to avoid saving and restoring from .RData files (go to Tools > Global Options > General). This will help ensure that your R code runs the way you think it does, instead of depending on some long forgotten code that is only saved in the workspace image. Everything important should be in a script. Anything saved or loaded from file should be done explicitly.

In R, you can save a script with the “.R” extension, which indicates that it’s an R script file. Here’s a step-by-step guide to saving a script in R:

  1. Write your R code: Start typing or pasting your R code. Your R script can contain multiple lines of code, functions, comments, and any other valid R commands.

  2. Save the file: Once you have written your R script, it’s time to save it. Go to “File” in the menu bar and select “Save As” or “Save.”

  3. Choose the file name: In the save dialog, choose a name for your R script. It’s common to use a “.R” extension at the end of the file name.

  4. Choose the location: Navigate to the directory where you want to save your R script. You can create a new folder or choose an existing one.

  5. Save the file: Click the “Save” button, and your R script will be saved with the chosen name and location.

By saving the script as an “.R” file, you can easily open and run it in R or RStudio later. Additionally, having your code in script files makes it more organized and allows for better version control and collaboration with others.

2.5 Help Files

In R, we use help(), ?, and ?? to access package documentation and obtain information about functions, objects, datasets, and other topics related to R programming. These commands are crucial to discover, explore, and understand the functionalities of packages, functions, and objects, saving time and aiding in package and function selection for your tasks.

For instance, help(package = "tidyverse"), ?tidyverse, and ??tidyverse are used to access package documentation and information related to the “tidyverse” package and functions. Let’s look at each of these commands:

  1. help(package = "tidyverse"): This command is used to display the help page for the “tidyverse” package, which provides an overview and information about the package. It includes a brief description of the “tidyverse” and lists all the packages included in it. Additionally, it may contain other useful details, such as the authors, version, and dependencies.
help(package = "tidyverse")
  1. ?tidyverse: This command is used to directly access the help page of a specific function within the “tidyverse” package. Replace “tidyverse” with the name of the function you want to learn more about. For example, ?mutate will display the help page for the mutate() function, which is part of the “dplyr” package, a core component of the “tidyverse.”
?mutate
  1. ??tidyverse: The double question marks (??) are used to perform a broader search across all installed packages for any help pages or documentation that contain the term “tidyverse.” This includes not only the package-level help but also any functions, datasets, or other topics related to the “tidyverse” available in different packages. It can be useful for finding additional functions or information related to the “tidyverse” provided by packages beyond the core “tidyverse” package itself.
??tidyverse

2.6 Interactive R Learning: Lab 2

By now, swirl should be installed on your machine (refer to Section 1.6 if you’re unsure).

# Load the swirl package:
library(swirl)

There’s no need to reinstall the qss-swirl lessons. Simply commence a qss-swirl lesson after loading the swirl package.

# Start a qss-swirl lesson (after loading the swirl package):
swirl()

For Lab 2, we’ll work on Lesson 2: INTRO2.