So you’re About to be a Grad Student?

Day 1 of the RI workshop, Summer 2023

Austin Cutler

FSU

What’s this workshop?

  • This workshop is meant to be an introduction to using R
  • No prior knowledge is assumed
  • The goal is to make your time in POS 5737 a little easier
    • R is fairly complicated to use, but we’re going to get as far as we can this week!

Introductions!

  • Austin
  • Buffalo, NY
  • Americanist
  • I came to FSU because my Master’s Thesis adviser said it was a good idea
  • I also coach high school track (hurdles)!

Ground Rules for the Class

  • Please ask questions!
  • Look into Stack Exchange
  • Use the help function in Rstudio
  • Please don’t use ChatGpt…yet

Today’s Class:

  • Welcoming you all to grad school/some tips on how to make this whole thing manageable
  • What is R?
    • The different parts of R studio
    • Do you have it all installed?
    • How’s it work
  • Using R as a calculator
  • What are R scripts and functions?
  • What’s an object?
  • How to read data into R
  • What does that data look like?
  • What to do with the data once it’s in there?

Welcome to FSU!

This part of the class will be a little different than the rest of the class, here’s what we’ll cover:

  • What grad school is/is not
  • Expectations
  • How do you manage this

What Grad School Is/Isn’t

  • This is not undergrad 2.0
  • This is a lot of work
  • It is very rewarding, you’ll likely have a lot more freedom than you’re used to

Expectations

Go to class (unless you actually can’t)

  • Participate when you do go to class
  • Actually do the reading, as much of it as you can do (it is a lot)

Go to department functions

  • Chris’ll say more about this at orientation, but ~3 hours of your contract is allotted to professionalization events

Fulfilling your GA and RA responsibilities

  • There is heterogeneity in what will be asked by faculty, but they’re all usually good about not using more of your time than they’re given
  • If there is a here problem, talk to Chris

Managing Grad School

  • Imposter syndrome is real, do your best to not let it get you too much
  • FSU has therapy!
  • Work together!
    • Most methods classes it is not only welcomed, but arguably necessary to work together on assignments (just make sure to turn in your own work)
  • Keep a schedule +Find what times are best for you to complete certain tasks and lean into that, maintaining a schedule is helpful for getting things done (How to Write a lot)
  • Do things other than this
    • Tallahassee is a mediumish sized city, there’s lots to do around and at FSU

Now For Some R

Raise your hand if you do not have R/Rstudio installed yet

Changing the Rstudio Theme

Using R as a Calculator

1 + 1
[1] 2
1 + 3
[1] 4
100000000000000 * 21000000000000
[1] 2.1e+27
sqrt(400)
[1] 20

Practice with R as a calculator

Take a moment and perform the following calculations:

  1. \(9+10\)
  2. \(9*10\)
  3. \(9/10\)
  4. \(9^2\)
  5. \(\sqrt{32}\)
  6. \(log9\) 1

The Different Parts of Rstudio

Object Oriented Computing

Scalars

b <- 2

b + 2
[1] 4
#you can use the # symbol to leave comments in your code
## assigning 2 to the letter b
b <- 2

## adding b and 2
b + 2
[1] 4
  • Obviously, just using 2+2 instead of assigning that value to b would be easier, this is just to demonstrate how objects work

Functions

# taking the squareroot of 400
sqrt(400)
[1] 20
# taking the squareroot of 400, assigning it
sq400 <- sqrt(400)

sq400
[1] 20
sqrt(sq400)
[1] 4.472136

Strings

# words are represented as character strings
'Hello World'
[1] "Hello World"

Strings

# words are represented as character strings
'Hello World'
[1] "Hello World"
# understanding this is useful for understanding how data is structured
## strings cannot interact with numeric vectors
'Hello World' + 2
Error in "Hello World" + 2: non-numeric argument to binary operator
#the same is true even when the character vector is actually a number
'3' + 2
Error in "3" + 2: non-numeric argument to binary operator

Strings

# words are represented as character strings
'Hello World'
[1] "Hello World"
# understanding this is useful for understanding how data is structured
## strings cannot interact with numeric vectors
'Hello World' + 2
Error in "Hello World" + 2: non-numeric argument to binary operator
#the same is true even when the character vector is actually a number
'3' + 2
Error in "3" + 2: non-numeric argument to binary operator
#there are functions that we can use to fix this issue, however
as.numeric('3') + 2
[1] 5
#loading in a library (more on this later)
library(tidyverse)

#using parse_number to treat character as a number
parse_number('3') + 2
[1] 5

Strings

#the benefit of parse_number, is it will pull the number out of a string
parse_number('himothy316') + 2
[1] 318

Strings

#the benefit of parse_number, is it will pull the number out of a string
parse_number('himothy316') + 2
[1] 318
#as.numeric, however, cannot do that
as.numeric('himothy316') + 2
[1] NA

Strings

#the benefit of parse_number, is it will pull the number out of a string
parse_number('himothy316') + 2
[1] 318
#as.numeric, however, cannot do that
as.numeric('himothy316') + 2
[1] NA
  • This difference is useful in situations where you have numbers being used to label things in your data
parse_number('ideo2')
[1] 2

Vectors

  • Working with only a single number is not realistic
  • Instead, we can work with a collection of values, lets start with vectors

There are a few different types of vectors, they are:

  • numeric: contain only numeric values such as 1.1, 2, 100, etc.
  • character: contain only strings like we went over in the previous sections, such as "Republican", "Democratic", "Independent
  • factor: these are ordered character strings, think a scale that goes "Very Liberal" to "Very Conservative" or "Complete Autocracy" to "Democracy"
  • logical: contains only TRUE or FALSE values

Numeric Vectors

  • Here is an example of code you’d use to make a numeric vector
a <- c(5,10,15,100)
  • Just like before, you can simple run the name of the object to call the values
a
[1]   5  10  15 100
  • You can also apply functions to vectors just like with scalars
a/5
[1]  1  2  3 20
exp(a)
[1] 1.484132e+02 2.202647e+04 3.269017e+06 2.688117e+43
mean(a) #congrats, you're now doing statistics
[1] 32.5

Character Vectors

p <- c("Republican", "Democrat", "Republican", "Independent")

p
[1] "Republican"  "Democrat"    "Republican"  "Independent"
  • Just like with individual strings, you can’t apply numeric functions to character vectors
p*3
Error in p * 3: non-numeric argument to binary operator
  • Brief note, within a string, “escapes” can change the spacing of the text.
    • you use \ to apply escapes, \n is a new line, and \t is a tab, for instances
cat("Repu\nblican")
Repu
blican
cat("Demo\tcrat")
Demo    crat

Factor Vectors

p1 <- c("Republican", "Republican", "Democrat", "Other")
class(p1)
[1] "character"
table(p1)
p1
  Democrat      Other Republican 
         1          1          2 
  • Problem: The ordering is weird, and independents are missing
#adding levels to p1
pid <- factor(p1, levels = c("Republican",
                              "Independent",
                              "Democrat",
                              "Other"))

Factor Vectors

  • Now we can check our work and see that both our problems are solved
#checking our work
class(pid)
[1] "factor"
table(pid)
pid
 Republican Independent    Democrat       Other 
          2           0           1           1 

Practice with Vectors

  • For all of the following, include comments throughout your code

Numeric:

  • Make a new vector, named my_vector with the values 2,6,4,3,5,17.
  • Find the sum of the vector (use the sum() function)
  • Find the square root (sqrt()) of the elements of the vector

Character:

  • Create a vector named sex with the following elements: Male, Female, Male, Male, and Female
  • Create a vector ideology with the following observations: Liberal, Moderate, Moderate, Conservative, and Liberal

Factor:

  • Take the character vector sex and add levels. Assign the levels such that the order is Male then Female.
  • Take the ideology vector from before and order them in a way that makes sense.

Missing Values

  • Missing values are denoted as NA
  • Functions handle these differently, some will just ignore them
x <- c(1,4,7,NA,2)
log(x)
[1] 0.0000000 1.3862944 1.9459101        NA 0.6931472
  • Some will not work with NAs
sum(x)
[1] NA
  • Unless change some of the arguments in the function
sum(x, na.rm = TRUE)
[1] 14

Logical Operators

  • Logical operators are used in R to test if certain conditions hold
Operator Syntax
“less than” <
“less than or equal to” <=
“exactly equal to” ==
“greater than or equal to” >=
“greater than” >
“not equal to” !=
“or” |
“and” &

Dataframes and Tibbles

  • The data you work with will largely be organized into objects known as data frames or tibbles (functionally identical)
  • Think of a data frame/tibble as a collection of vectors
  • The same logic from before with vectors can be applied here
  • Below is how you would make a dataframe
data.frame('name' = c('John', 'Jacob', 'Jingleheimer Schmidt'),
           'ideo' = c(1, 4, 7),
           'sex'  = c('Male', 'Male', 'Male')) -> his_name

Dataframes and Tibbles

  • Reminder to run the following in your console to install the tidyverse package
install.packages('tidyverse')
  • The tidyverse equivalent to a data.frame
library(tidyverse)

tibble('name' = c('John', 'Jacob', 'Jingleheimer Schmidt'),
       'ideo' = c(1, 4, 7),
       'sex'  = c('Male', 'Male', 'Male')) -> his_name
  • If you want to call a specific column from a dataframe, you can use the $ operator, which will give you that column as a vector
his_name$name
[1] "John"                 "Jacob"                "Jingleheimer Schmidt"
his_name$ideo/2
[1] 0.5 2.0 3.5

Dataframes and Tibbles

  • You can also collect separate vectors into a dataframe/tibble if their dimensions are the same
name <- c('John', 'Jacob', 'Jingleheimer Schmidt')
ideo <- c(1, 4, 7)
sex  <- c('Male', 'Male', 'Male')

tibble(name,ideo,sex)
# A tibble: 3 × 3
  name                  ideo sex  
  <chr>                <dbl> <chr>
1 John                     1 Male 
2 Jacob                    4 Male 
3 Jingleheimer Schmidt     7 Male 

Making Data Usable

  • The data we get is often not usable right off the bat
  • For instance, if we want to make a new variable, we can use the mutate function to accomplish this
#calling the data from above "his name"
his_name %>% 
  mutate(ideo_cat = c('Very Conservative',
                      'Moderate',
                      'Very Liberal')) 
# A tibble: 3 × 4
  name                  ideo sex   ideo_cat         
  <chr>                <dbl> <chr> <chr>            
1 John                     1 Male  Very Conservative
2 Jacob                    4 Male  Moderate         
3 Jingleheimer Schmidt     7 Male  Very Liberal     
  • Note the operator %>%, this is a pipe, can be called using cntrl (or command)+shift+m

Mutate and case_when

  • The code above only works because we knew the exact dimensions of the data, that’s unrealistic. We can create a new variable conditional on the value of another using case_when() with mutate() (think of it as a glorified if-then)
his_name %>% 
  mutate(ideo_cat = case_when(ideo == 1 ~ 'Very Conservative',
                              ideo == 4 ~ 'Moderate',
                              ideo == 7 ~ 'Very Liberal')) -> his_name_2

his_name_2
# A tibble: 3 × 4
  name                  ideo sex   ideo_cat         
  <chr>                <dbl> <chr> <chr>            
1 John                     1 Male  Very Conservative
2 Jacob                    4 Male  Moderate         
3 Jingleheimer Schmidt     7 Male  Very Liberal     

Filter

  • If we want a subset of the data, we can use the filter function
his_name_2 %>% 
  filter(ideo_cat == 'Very Liberal')
# A tibble: 1 × 4
  name                  ideo sex   ideo_cat    
  <chr>                <dbl> <chr> <chr>       
1 Jingleheimer Schmidt     7 Male  Very Liberal
  • Note that the function will pull whatever condition you specify, not remove it.
  • To accomplish that, we could do the following:
his_name_2 %>% 
  filter(ideo_cat != 'Very Liberal')
# A tibble: 2 × 4
  name   ideo sex   ideo_cat         
  <chr> <dbl> <chr> <chr>            
1 John      1 Male  Very Conservative
2 Jacob     4 Male  Moderate         

Variable Names

  • Many of the data sets we want to use in our research come in terrible condition, and this is most evident in the naming of variables
tibble('X1' = c(123,124,125,126),
       'X2' = c(1,2,4,7),
       'X3' = c('abortion', 'health care', 'guns', 'police')) -> data
  • What are each of these variables?

Variable Names

  • When renaming variables, it is important to consider two things:
    • Is the name informative?
    • Is the name easy to work with?
data %>% 
  rename('id'    = X1,
         'ideo'  = X2,
         'issue' = X3) 
# A tibble: 4 × 3
     id  ideo issue      
  <dbl> <dbl> <chr>      
1   123     1 abortion   
2   124     2 health care
3   125     4 guns       
4   126     7 police     

Variable Names

  • Other data will have informative names, that are too hard to work with, such as:
tibble('ID Number' = c(123,124,125,126),
       'Ideology Numeric' = c(1,2,4,7),
       'Most Important Issue' = c('abortion', 'health care', 'guns', 'police')) -> data
  • We can use the clean_names() function from the janitor package to fix this
    • Note that you will have to install janitor first using the following (pop quiz, how would I do that?)
Show Answer
install.packages('janitor')

Variable Names

  • Fixing the names can be done this way
  • Note that we can use :: to call one function from a package
data %>% 
  janitor::clean_names() -> data

data
# A tibble: 4 × 3
  id_number ideology_numeric most_important_issue
      <dbl>            <dbl> <chr>               
1       123                1 abortion            
2       124                2 health care         
3       125                4 guns                
4       126                7 police              
  • Without that in front or the package not loaded, the code will not work
data %>%
  clean_names()
Error in clean_names(.): could not find function "clean_names"

Select and Slice

  • Data also usually have variables we don’t want
  • We can use the select() function with either the variable’s name or position
data %>% 
  select(most_important_issue)
# A tibble: 4 × 1
  most_important_issue
  <chr>               
1 abortion            
2 health care         
3 guns                
4 police              
data %>% 
  select(3)
# A tibble: 4 × 1
  most_important_issue
  <chr>               
1 abortion            
2 health care         
3 guns                
4 police              

Select and Slice

  • We can do the same thing for rows using the slice() function
  • Say we have data where the first row of the data is the names of the variables
tibble(country = c('country', 'USA', 'China', 'Germany'),
       wars    = c('wars', 2, 4, 5),
       pres    = c('pres', 1,0,1),
       par     = c('par', 'Congress', 'None', 'Parliament')) -> country

country
# A tibble: 4 × 4
  country wars  pres  par       
  <chr>   <chr> <chr> <chr>     
1 country wars  pres  par       
2 USA     2     1     Congress  
3 China   4     0     None      
4 Germany 5     1     Parliament

Select and Slice

  • Using the slice function, we can remove that row
country %>% 
  slice(-1) -> country
  • Note that using the minus sign will remove the row, the same can be used for select
country %>% 
  select(-wars)
# A tibble: 3 × 3
  country pres  par       
  <chr>   <chr> <chr>     
1 USA     1     Congress  
2 China   0     None      
3 Germany 1     Parliament

More on Cleaning

  • One of the last steps for cleaning data is making sure it is in the proper format
  • Look at the variable types in our sample data
country
# A tibble: 3 × 4
  country wars  pres  par       
  <chr>   <chr> <chr> <chr>     
1 USA     2     1     Congress  
2 China   4     0     None      
3 Germany 5     1     Parliament
  • The variables war and pres are character variables, when they need to be numeric
  • Using all of the functions we previously learned, we can fix this issue

Practice with Data Cleaning

  • In R, make the same data I’m using with the following code:
tibble(country = c('country', 'USA', 'China', 'Germany'),
       wars    = c('wars', 2, 4, 5),
       pres    = c('pres', 1,0,1),
       par     = c('par', 'Congress', 'None', 'Parliament')) -> country
  • Once the data is made, use slice(), parse_character() and mutate() to clean the data
Show Answer
country %>% 
  slice(-1) %>% 
  mutate(wars = parse_number(wars),
         pres = parse_number(pres))
# A tibble: 3 × 4
  country  wars  pres par       
  <chr>   <dbl> <dbl> <chr>     
1 USA         2     1 Congress  
2 China       4     0 None      
3 Germany     5     1 Parliament

Reading data into R

  • You won’t make your own data and instead the data you want will be in some file, typically a csv
  • We can read in csvs as such
library(tidyverse)
#using base R
anes <- read.csv('anes.csv')

#using tidyverse
anes <- read_csv('anes.csv')
Note that running this code back to back will result in only the second version of `anes` remaining in your global environment.

Reading data into R

  • The code above only works if the file you’re trying to load in is already in your working director
  • There are a few ways of dealing with this
    • The setwd() function allows you to manually set your working director
setwd('C:/Users/Austin/path/to/file')

anes <- read.csv('anes.csv')
Note the forward slashes and that the path is read in as a string, this is on windows
  • We can work in R projects (don’t have time for specifics here)
  • We can specify the path inside the code
library(tidyverse)

anes <- read_csv('C:/Users/Austin/path/to/file/anes.csv')

Practice

  • On the course website, under Day 1, download the Olympics data and do the following:

    1. Read the data into R using both the tidyverse (read_csv()) and baseR (read.csv()) version of the function, note the differences
    2. Use select() so the data only has the country, winter, and summer variables.
    3. Rename the variables to something that makes sense
    4. Make a new variable that is the total number of medals:
    • hint: you will need to use the rowwise() to sum in each row, I will put that on the board when everyone is ready
    1. Use filter() to show the results for only one country
    2. Use filter() to remove one country

Rowwise Code

    #remember to put the name of your own data here
    data %>% 
      rowwise() %>% 
      mutate()
library(tidyverse)

# 1. Reading in the data
read.csv('olympics.csv') -> olympics

# here I'm not saving the tidyverse loading in because I like how read.csv handles
# the variable names better
read_csv('olympics.csv')

# 2. keeping only the variables that we want
olympics %>% 
  select(X0, X1, X6) -> olympics

# 3. Renaming the varibales to something that makes sense
olympics %>% 
  rename('country' = X0,
         'summer'  = X1,
         'winter'  = X6) -> olympics

# 4. Making new variable for the total
olympics %>% 
  slice(-1) %>% 
  rowwise() %>% 
  mutate(total = sum(parse_number(summer), parse_number(winter))) -> olympics

# 5. Filtering for only Germany
olympics %>% 
  filter(country == 'Germany')

# 6. Filtering to remove Chile
olympics %>% 
  filter(country != 'Chile')