So you’re About to be a Grad Student?

Day 1 of the RI workshop, Summer 2023

Austin Cutler

FSU

What’s this workshop?

This workshop is meant to be an introduction to using R
No prior knowledge is assumed
The goal is to make your time in POS 5737 a little easier
- R is fairly complicated to use, but we’re going to get as far as we can this week!

Introductions!

Austin
Buffalo, NY
Americanist
I came to FSU because my Master’s Thesis adviser said it was a good idea
I also coach high school track (hurdles)!

Ground Rules for the Class

Please ask questions!
Look into Stack Exchange
Use the help function in Rstudio
Please don’t use ChatGpt…yet

Today’s Class:

Welcoming you all to grad school/some tips on how to make this whole thing manageable
What is R?
- The different parts of R studio
- Do you have it all installed?
- How’s it work
Using R as a calculator
What are R scripts and functions?
What’s an object?
How to read data into R
What does that data look like?
What to do with the data once it’s in there?

Welcome to FSU!

This part of the class will be a little different than the rest of the class, here’s what we’ll cover:

What grad school is/is not
Expectations
How do you manage this

What Grad School Is/Isn’t

This is not undergrad 2.0
This is a lot of work
It is very rewarding, you’ll likely have a lot more freedom than you’re used to

Expectations

Go to class (unless you actually can’t)

Participate when you do go to class
Actually do the reading, as much of it as you can do (it is a lot)

Go to department functions

Chris’ll say more about this at orientation, but ~3 hours of your contract is allotted to professionalization events

Fulfilling your GA and RA responsibilities

There is heterogeneity in what will be asked by faculty, but they’re all usually good about not using more of your time than they’re given
If there is a here problem, talk to Chris

Managing Grad School

Imposter syndrome is real, do your best to not let it get you too much
FSU has therapy!
Work together!
- Most methods classes it is not only welcomed, but arguably necessary to work together on assignments (just make sure to turn in your own work)
Keep a schedule +Find what times are best for you to complete certain tasks and lean into that, maintaining a schedule is helpful for getting things done (How to Write a lot)
Do things other than this
- Tallahassee is a mediumish sized city, there’s lots to do around and at FSU

Now For Some R

Raise your hand if you do not have R/Rstudio installed yet

Changing the Rstudio Theme

Using R as a Calculator

1 + 1

[1] 2

1 + 3

[1] 4

100000000000000 * 21000000000000

[1] 2.1e+27

sqrt(400)

[1] 20

Practice with R as a calculator

Take a moment and perform the following calculations:

$9+10$
$9*10$
$9/10$
$9^2$
$\sqrt{32}$
$log9$ ¹

The Different Parts of Rstudio

Object Oriented Computing

Scalars

b <- 2

b + 2

[1] 4

#you can use the # symbol to leave comments in your code
## assigning 2 to the letter b
b <- 2

## adding b and 2
b + 2

[1] 4

Obviously, just using 2+2 instead of assigning that value to b would be easier, this is just to demonstrate how objects work

Functions

# taking the squareroot of 400
sqrt(400)

[1] 20

# taking the squareroot of 400, assigning it
sq400 <- sqrt(400)

sq400

[1] 20

sqrt(sq400)

[1] 4.472136

Strings

# words are represented as character strings
'Hello World'

[1] "Hello World"

Strings

# words are represented as character strings
'Hello World'

[1] "Hello World"

# understanding this is useful for understanding how data is structured
## strings cannot interact with numeric vectors
'Hello World' + 2

Error in "Hello World" + 2: non-numeric argument to binary operator

#the same is true even when the character vector is actually a number
'3' + 2

Error in "3" + 2: non-numeric argument to binary operator

Strings

# words are represented as character strings
'Hello World'

[1] "Hello World"

# understanding this is useful for understanding how data is structured
## strings cannot interact with numeric vectors
'Hello World' + 2

Error in "Hello World" + 2: non-numeric argument to binary operator

#the same is true even when the character vector is actually a number
'3' + 2

Error in "3" + 2: non-numeric argument to binary operator

#there are functions that we can use to fix this issue, however
as.numeric('3') + 2

[1] 5

#loading in a library (more on this later)
library(tidyverse)

#using parse_number to treat character as a number
parse_number('3') + 2

[1] 5

Strings

#the benefit of parse_number, is it will pull the number out of a string
parse_number('himothy316') + 2

[1] 318

Strings

#the benefit of parse_number, is it will pull the number out of a string
parse_number('himothy316') + 2

[1] 318

#as.numeric, however, cannot do that
as.numeric('himothy316') + 2

[1] NA

Strings

#the benefit of parse_number, is it will pull the number out of a string
parse_number('himothy316') + 2

[1] 318

#as.numeric, however, cannot do that
as.numeric('himothy316') + 2

[1] NA

This difference is useful in situations where you have numbers being used to label things in your data

parse_number('ideo2')

[1] 2

Vectors

Working with only a single number is not realistic
Instead, we can work with a collection of values, lets start with vectors

There are a few different types of vectors, they are:

numeric: contain only numeric values such as 1.1, 2, 100, etc.
character: contain only strings like we went over in the previous sections, such as "Republican", "Democratic", "Independent
factor: these are ordered character strings, think a scale that goes "Very Liberal" to "Very Conservative" or "Complete Autocracy" to "Democracy"
logical: contains only TRUE or FALSE values

Numeric Vectors

Here is an example of code you’d use to make a numeric vector

a <- c(5,10,15,100)

Just like before, you can simple run the name of the object to call the values

[1]   5  10  15 100

You can also apply functions to vectors just like with scalars

a/5

[1]  1  2  3 20

exp(a)

[1] 1.484132e+02 2.202647e+04 3.269017e+06 2.688117e+43

mean(a) #congrats, you're now doing statistics

[1] 32.5

Character Vectors

p <- c("Republican", "Democrat", "Republican", "Independent")

p

[1] "Republican"  "Democrat"    "Republican"  "Independent"

Just like with individual strings, you can’t apply numeric functions to character vectors

p*3

Error in p * 3: non-numeric argument to binary operator

Brief note, within a string, “escapes” can change the spacing of the text.
- you use \ to apply escapes, \n is a new line, and \t is a tab, for instances

cat("Repu\nblican")

Repu
blican

cat("Demo\tcrat")

Demo    crat

Factor Vectors

p1 <- c("Republican", "Republican", "Democrat", "Other")
class(p1)

[1] "character"

table(p1)

p1
  Democrat      Other Republican 
         1          1          2

Problem: The ordering is weird, and independents are missing

#adding levels to p1
pid <- factor(p1, levels = c("Republican",
                              "Independent",
                              "Democrat",
                              "Other"))

Factor Vectors

Now we can check our work and see that both our problems are solved

#checking our work
class(pid)

[1] "factor"

table(pid)

pid
 Republican Independent    Democrat       Other 
          2           0           1           1

Practice with Vectors

For all of the following, include comments throughout your code

Numeric:

Make a new vector, named my_vector with the values 2,6,4,3,5,17.
Find the sum of the vector (use the sum() function)
Find the square root (sqrt()) of the elements of the vector

Character:

Create a vector named sex with the following elements: Male, Female, Male, Male, and Female
Create a vector ideology with the following observations: Liberal, Moderate, Moderate, Conservative, and Liberal

Factor:

Take the character vector sex and add levels. Assign the levels such that the order is Male then Female.
Take the ideology vector from before and order them in a way that makes sense.

Missing Values

Missing values are denoted as NA
Functions handle these differently, some will just ignore them

x <- c(1,4,7,NA,2)
log(x)

[1] 0.0000000 1.3862944 1.9459101        NA 0.6931472

Some will not work with NAs

sum(x)

[1] NA

Unless change some of the arguments in the function

sum(x, na.rm = TRUE)

[1] 14

Logical Operators

Logical operators are used in R to test if certain conditions hold

Operator	Syntax
“less than”	`<`
“less than or equal to”	`<=`
“exactly equal to”	`==`
“greater than or equal to”	`>=`
“greater than”	`>`
“not equal to”	`!=`
“or”	`\|`
“and”	`&`

Dataframes and Tibbles

The data you work with will largely be organized into objects known as data frames or tibbles (functionally identical)
Think of a data frame/tibble as a collection of vectors
The same logic from before with vectors can be applied here
Below is how you would make a dataframe

data.frame('name' = c('John', 'Jacob', 'Jingleheimer Schmidt'),
           'ideo' = c(1, 4, 7),
           'sex'  = c('Male', 'Male', 'Male')) -> his_name

Dataframes and Tibbles

Reminder to run the following in your console to install the tidyverse package

install.packages('tidyverse')

The tidyverse equivalent to a data.frame

library(tidyverse)

tibble('name' = c('John', 'Jacob', 'Jingleheimer Schmidt'),
       'ideo' = c(1, 4, 7),
       'sex'  = c('Male', 'Male', 'Male')) -> his_name

If you want to call a specific column from a dataframe, you can use the $ operator, which will give you that column as a vector

his_name$name

[1] "John"                 "Jacob"                "Jingleheimer Schmidt"

his_name$ideo/2

[1] 0.5 2.0 3.5

Dataframes and Tibbles

You can also collect separate vectors into a dataframe/tibble if their dimensions are the same

name <- c('John', 'Jacob', 'Jingleheimer Schmidt')
ideo <- c(1, 4, 7)
sex  <- c('Male', 'Male', 'Male')

tibble(name,ideo,sex)

# A tibble: 3 × 3
  name                  ideo sex  
  <chr>                <dbl> <chr>
1 John                     1 Male 
2 Jacob                    4 Male 
3 Jingleheimer Schmidt     7 Male

Making Data Usable

The data we get is often not usable right off the bat
For instance, if we want to make a new variable, we can use the mutate function to accomplish this

#calling the data from above "his name"
his_name %>% 
  mutate(ideo_cat = c('Very Conservative',
                      'Moderate',
                      'Very Liberal'))

# A tibble: 3 × 4
  name                  ideo sex   ideo_cat         
  <chr>                <dbl> <chr> <chr>            
1 John                     1 Male  Very Conservative
2 Jacob                    4 Male  Moderate         
3 Jingleheimer Schmidt     7 Male  Very Liberal

Note the operator %>%, this is a pipe, can be called using cntrl (or command)+shift+m

Mutate and case_when

The code above only works because we knew the exact dimensions of the data, that’s unrealistic. We can create a new variable conditional on the value of another using case_when() with mutate() (think of it as a glorified if-then)

his_name %>% 
  mutate(ideo_cat = case_when(ideo == 1 ~ 'Very Conservative',
                              ideo == 4 ~ 'Moderate',
                              ideo == 7 ~ 'Very Liberal')) -> his_name_2

his_name_2

# A tibble: 3 × 4
  name                  ideo sex   ideo_cat         
  <chr>                <dbl> <chr> <chr>            
1 John                     1 Male  Very Conservative
2 Jacob                    4 Male  Moderate         
3 Jingleheimer Schmidt     7 Male  Very Liberal

Filter

If we want a subset of the data, we can use the filter function

his_name_2 %>% 
  filter(ideo_cat == 'Very Liberal')

# A tibble: 1 × 4
  name                  ideo sex   ideo_cat    
  <chr>                <dbl> <chr> <chr>       
1 Jingleheimer Schmidt     7 Male  Very Liberal

Note that the function will pull whatever condition you specify, not remove it.
To accomplish that, we could do the following:

his_name_2 %>% 
  filter(ideo_cat != 'Very Liberal')

# A tibble: 2 × 4
  name   ideo sex   ideo_cat         
  <chr> <dbl> <chr> <chr>            
1 John      1 Male  Very Conservative
2 Jacob     4 Male  Moderate

Variable Names

Many of the data sets we want to use in our research come in terrible condition, and this is most evident in the naming of variables

tibble('X1' = c(123,124,125,126),
       'X2' = c(1,2,4,7),
       'X3' = c('abortion', 'health care', 'guns', 'police')) -> data

What are each of these variables?

Variable Names

When renaming variables, it is important to consider two things:
- Is the name informative?
- Is the name easy to work with?

data %>% 
  rename('id'    = X1,
         'ideo'  = X2,
         'issue' = X3)

# A tibble: 4 × 3
     id  ideo issue      
  <dbl> <dbl> <chr>      
1   123     1 abortion   
2   124     2 health care
3   125     4 guns       
4   126     7 police

Variable Names

Other data will have informative names, that are too hard to work with, such as:

tibble('ID Number' = c(123,124,125,126),
       'Ideology Numeric' = c(1,2,4,7),
       'Most Important Issue' = c('abortion', 'health care', 'guns', 'police')) -> data

We can use the clean_names() function from the janitor package to fix this
- Note that you will have to install janitor first using the following (pop quiz, how would I do that?)

Show Answer

install.packages('janitor')

Variable Names

Fixing the names can be done this way
Note that we can use :: to call one function from a package

data %>% 
  janitor::clean_names() -> data

data

# A tibble: 4 × 3
  id_number ideology_numeric most_important_issue
      <dbl>            <dbl> <chr>               
1       123                1 abortion            
2       124                2 health care         
3       125                4 guns                
4       126                7 police

Without that in front or the package not loaded, the code will not work

data %>%
  clean_names()

Error in clean_names(.): could not find function "clean_names"

Select and Slice

Data also usually have variables we don’t want
We can use the select() function with either the variable’s name or position

data %>% 
  select(most_important_issue)

# A tibble: 4 × 1
  most_important_issue
  <chr>               
1 abortion            
2 health care         
3 guns                
4 police

data %>% 
  select(3)

# A tibble: 4 × 1
  most_important_issue
  <chr>               
1 abortion            
2 health care         
3 guns                
4 police

Select and Slice

We can do the same thing for rows using the slice() function
Say we have data where the first row of the data is the names of the variables

tibble(country = c('country', 'USA', 'China', 'Germany'),
       wars    = c('wars', 2, 4, 5),
       pres    = c('pres', 1,0,1),
       par     = c('par', 'Congress', 'None', 'Parliament')) -> country

country

# A tibble: 4 × 4
  country wars  pres  par       
  <chr>   <chr> <chr> <chr>     
1 country wars  pres  par       
2 USA     2     1     Congress  
3 China   4     0     None      
4 Germany 5     1     Parliament

Select and Slice

Using the slice function, we can remove that row

country %>% 
  slice(-1) -> country

Note that using the minus sign will remove the row, the same can be used for select

country %>% 
  select(-wars)

# A tibble: 3 × 3
  country pres  par       
  <chr>   <chr> <chr>     
1 USA     1     Congress  
2 China   0     None      
3 Germany 1     Parliament

More on Cleaning

One of the last steps for cleaning data is making sure it is in the proper format
Look at the variable types in our sample data

country

# A tibble: 3 × 4
  country wars  pres  par       
  <chr>   <chr> <chr> <chr>     
1 USA     2     1     Congress  
2 China   4     0     None      
3 Germany 5     1     Parliament

The variables war and pres are character variables, when they need to be numeric
Using all of the functions we previously learned, we can fix this issue

Practice with Data Cleaning

In R, make the same data I’m using with the following code:

tibble(country = c('country', 'USA', 'China', 'Germany'),
       wars    = c('wars', 2, 4, 5),
       pres    = c('pres', 1,0,1),
       par     = c('par', 'Congress', 'None', 'Parliament')) -> country

Once the data is made, use slice(), parse_character() and mutate() to clean the data

Show Answer

country %>% 
  slice(-1) %>% 
  mutate(wars = parse_number(wars),
         pres = parse_number(pres))

# A tibble: 3 × 4
  country  wars  pres par       
  <chr>   <dbl> <dbl> <chr>     
1 USA         2     1 Congress  
2 China       4     0 None      
3 Germany     5     1 Parliament

Reading data into R

You won’t make your own data and instead the data you want will be in some file, typically a csv
We can read in csvs as such

library(tidyverse)
#using base R
anes <- read.csv('anes.csv')

#using tidyverse
anes <- read_csv('anes.csv')

Note that running this code back to back will result in only the second version of `anes` remaining in your global environment.

Reading data into R

The code above only works if the file you’re trying to load in is already in your working director
There are a few ways of dealing with this
- The setwd() function allows you to manually set your working director

setwd('C:/Users/Austin/path/to/file')

anes <- read.csv('anes.csv')

Note the forward slashes and that the path is read in as a string, this is on windows

We can work in R projects (don’t have time for specifics here)
We can specify the path inside the code

library(tidyverse)

anes <- read_csv('C:/Users/Austin/path/to/file/anes.csv')

Practice

On the course website, under Day 1, download the Olympics data and do the following:
1. Read the data into R using both the tidyverse (read_csv()) and baseR (read.csv()) version of the function, note the differences
2. Use select() so the data only has the country, winter, and summer variables.
3. Rename the variables to something that makes sense
4. Make a new variable that is the total number of medals:
- hint: you will need to use the rowwise() to sum in each row, I will put that on the board when everyone is ready
1. Use filter() to show the results for only one country
2. Use filter() to remove one country

Rowwise Code

    #remember to put the name of your own data here
    data %>% 
      rowwise() %>% 
      mutate()

library(tidyverse)

# 1. Reading in the data
read.csv('olympics.csv') -> olympics

# here I'm not saving the tidyverse loading in because I like how read.csv handles
# the variable names better
read_csv('olympics.csv')

# 2. keeping only the variables that we want
olympics %>% 
  select(X0, X1, X6) -> olympics

# 3. Renaming the varibales to something that makes sense
olympics %>% 
  rename('country' = X0,
         'summer'  = X1,
         'winter'  = X6) -> olympics

# 4. Making new variable for the total
olympics %>% 
  slice(-1) %>% 
  rowwise() %>% 
  mutate(total = sum(parse_number(summer), parse_number(winter))) -> olympics

# 5. Filtering for only Germany
olympics %>% 
  filter(country == 'Germany')

# 6. Filtering to remove Chile
olympics %>% 
  filter(country != 'Chile')