Reading Data into R and Aggregating Data

Day 2 of the RI workshop, Summer 2024

Austin Cutler

FSU

Class Today

  • Going over the HW
  • Different types of data and getting them into R
  • Reinforcing what we’ve learned so far
    • Different types of data
    • Functions used to clean data
    • How we might want to clean data
  • Learning more advanced techniques for cleaning data
    • grouping data using group_by()
    • creating summary tables using summarize()
  • Learning how to summarize data quickly
  • How to call a single variable
  • Lists

Questions about the homework?

HW Answers

library(tidyverse)

#load in the data
read_csv('class_toy_data.csv') -> data

#this is an easier way to accomplish what we need
#read_csv('class_toy_data.csv', skip = 1) -> data

#using rename instead
## first we remove the unnecessary row and variables
## then we use rename
data %>% 
  slice(-1) %>% 
  select(-c(x1,x11,x12,x13,x14)) %>% 
  rename('age'     = x2,
         'gender'  = x3,
         'educ'    = x4,
         'income'  = x5,
         'app_rat' = x6,
         'pid'     = x7,
         'opp_tex' = x8,
         'gun_opp' = x9,
         'abo_opp' = x10) -> data

# now lets get the data formatted properly
data %>% 
  mutate(age     = parse_number(age),
         income  = parse_number(income),
         app_rat = parse_number(app_rat),
         gun_opp = parse_number(gun_opp),
         abo_opp = parse_number(abo_opp)) -> data

# changing the negative value from approval to NA
data %>% 
  mutate(app_rat = case_when(app_rat < 0 ~ NA_real_,
                             TRUE ~ app_rat)) -> data

# keeping only republicans
data %>% 
  filter(pid == 'Republican') -> data

# saving the data
#write_csv(data, 'cutler_hw1_data')

# note that this could all technically be done in one big chunk
data %>% 
  slice(-1) %>% 
  select(-c(x1,x11,x12,x13,x14)) %>% 
  rename('age'     = x2,
         'gender'  = x3,
         'educ'    = x4,
         'income'  = x5,
         'app_rat' = x6,
         'pid'     = x7,
         'opp_tex' = x8,
         'gun_opp' = x9,
         'abo_opp' = x10) %>% 
  mutate(age     = parse_number(age),
         income  = parse_number(income),
         app_rat = parse_number(app_rat),
         gun_opp = parse_number(gun_opp),
         abo_opp = parse_number(abo_opp),
         app_rat = case_when(app_rat < 0 ~ NA_real_,
                             TRUE ~ app_rat)) %>% 
  filter(pid == 'Republican')

#alternative ways to accomplish getting the data in the right format:
read_csv('class_toy_data.csv', skip = 1)

#or if we hadn't done that
data %>% 
  mutate(across(c(age,income,app_rat,gun_opp,abo_opp), ~parse_number(.)))

#or we could even just
data %>% 
  mutate(across(where(is.numeric), parse_number))

Brief review

  1. What are the types of data we can encounter?

    Show Answer

    Numeric, string/character, factor, logical

  2. What is an object in R and where are they shown?

    Show Answer

    Created data, can be either a vector or data frame; global environment

  3. What is this %>% called?

    Show Answer

    A “Pipe”

  4. What functions did we learn about last class to clean data?

    Show Answer

    mutate(), select(), slice(), filter(), parse_number()

  5. What functions can be used to rename variables?

    Show Answer

    rename(), clean_names()

Other forms of Data

  • Data can come in forms other than a csv
  • Below are some common data formats
Data Format Function
.rds readRDS()
.RDATA load()
.txt' read.delim()
.dta read_dta()
.sav read_sav()
  • Note that the functions read_dta() and read_sav() come from the haven package

Cleaning Data

  • The main movers for cleaning data are mutate(), select(), and filter()
  • Functions can be used within other functions, and case_when() is very useful for shaping data when combined with mutate()

Cleaning Data Practice

library(tidyverse)

tibble("Full Name" = c("John Smith", "Jimmy Dean", "Robert Williams", 
                       "Emily Davis", "Michael Brown"),
  "Political Affiliation" = c("Democratic", "Republican", NA, "Democratic",
                              "Libertarian"),
  "Represented State" = c("California", "Texas", "New York", NA, "Florida"),
  "Politician Age" = c(45, 65, 60, 41, 20),
  "Years Served" = c(6, NA, 2, 4, 12),
  "Votes Received" = c(24000, NA, 15000, 20000, 32000),
  "Legislation Passed" = c(12, 10, NA, 6, 15)) -> real_congress

real_congress
# A tibble: 5 × 7
  `Full Name`     `Political Affiliation` `Represented State` `Politician Age`
  <chr>           <chr>                   <chr>                          <dbl>
1 John Smith      Democratic              California                        45
2 Jimmy Dean      Republican              Texas                             65
3 Robert Williams <NA>                    New York                          60
4 Emily Davis     Democratic              <NA>                              41
5 Michael Brown   Libertarian             Florida                           20
# ℹ 3 more variables: `Years Served` <dbl>, `Votes Received` <dbl>,
#   `Legislation Passed` <dbl>

Cleaning Data Practice

real_congress %>% 
  janitor::clean_names() %>% 
  rename('party' = political_affiliation,
         'state' = represented_state,
         'age' = politician_age) %>% 
  filter(!is.na(party)) %>% 
  mutate(age_cat = case_when(age < 30 ~ "<30",
                             (age >= 30 & age < 60) ~ "30-60",
                             age >= 60 ~ "60+")) %>% 
  select(full_name,party,state,age_cat) -> real_congress_2

real_congress_2
# A tibble: 4 × 4
  full_name     party       state      age_cat
  <chr>         <chr>       <chr>      <chr>  
1 John Smith    Democratic  California 30-60  
2 Jimmy Dean    Republican  Texas      60+    
3 Emily Davis   Democratic  <NA>       30-60  
4 Michael Brown Libertarian Florida    <30    

Live Coding

Now that we’ve covered the basics again, let’s go through a problem together, download the Live Coding 1 data on the course materials page

Live Coding

As a group, we are going to:

  1. Rename all the variables
  2. Make Categorical Variables for age, experience, legislator’s activeness using committee membership and bill sponsorship, and electoral outcome (let’s assume each district has 60,000 voters)
  3. Remove NAs from any variables where there are NAs
  4. Make the data set only have their name, party, years served, bills sponsored, and the categorical variables from above

Probably a break here (?)

Workshop Part 2

Aggregating Data

  • In some applications, it is useful to get aggregate level information about our data
  • We can use group_by() and summarize() to accomplish this
  • group_by works similarly to row_wise() from the homework, let’s start there

Aggregating Data

real_congress %>% 
  janitor::clean_names() %>% 
  rename('party' = political_affiliation,
         'state' = represented_state,
         'age' = politician_age) %>% 
  filter(!is.na(party)) -> real_congress

real_congress
# A tibble: 4 × 7
  full_name     party state   age years_served votes_received legislation_passed
  <chr>         <chr> <chr> <dbl>        <dbl>          <dbl>              <dbl>
1 John Smith    Demo… Cali…    45            6          24000                 12
2 Jimmy Dean    Repu… Texas    65           NA             NA                 10
3 Emily Davis   Demo… <NA>     41            4          20000                  6
4 Michael Brown Libe… Flor…    20           12          32000                 15
  • What are some ways we’d be interested in grouping this data?

Aggregating Data

  • Finding group numbers
real_congress %>% 
  group_by(party) %>% 
  summarize(sample = n())
# A tibble: 3 × 2
  party       sample
  <chr>        <int>
1 Democratic       2
2 Libertarian      1
3 Republican       1
  • What do you notice about the data produced here?

Aggregating Data

real_congress %>% 
  mutate(age_cat = case_when(age < 30 ~ "<30",
                             (age >= 30 & age < 60) ~ "30-60",
                             age >= 60 ~ "60+")) %>% 
  group_by(age_cat) %>% 
  summarize(n())
# A tibble: 3 × 2
  age_cat `n()`
  <chr>   <int>
1 30-60       2
2 60+         1
3 <30         1

Aggregating Data

  • We can also use this same style of coding to apply functions to categorical groups
real_congress %>% 
  group_by(party) %>% 
  summarize(age = mean(age))
# A tibble: 3 × 2
  party         age
  <chr>       <dbl>
1 Democratic     43
2 Libertarian    20
3 Republican     65

Aggregating Data

real_congress %>% 
    mutate(age_cat = case_when(age < 30 ~ "<30",
                             (age >= 30 & age < 60) ~ "30-60",
                             age >= 60 ~ "60+")) %>% 
  group_by(age_cat) %>% 
  summarize(leg_pro = mean(legislation_passed))
# A tibble: 3 × 2
  age_cat leg_pro
  <chr>     <dbl>
1 30-60         9
2 60+          10
3 <30          15
  • Notice how the data is ordered in a weird way, how could we fix that?

Show Answer

Change the variable from character to factor

Aggregating Data

real_congress %>% 
    mutate(age_cat = case_when(age < 30 ~ "<30",
                             (age >= 30 & age < 60) ~ "30-60",
                             age >= 60 ~ "60+"),
           age_cat = factor(age_cat,
                            levels = c('<30', '30-60', '60+'))) %>% 
  group_by(age_cat) %>% 
  summarize(leg_pro = mean(legislation_passed))
# A tibble: 3 × 2
  age_cat leg_pro
  <fct>     <dbl>
1 <30          15
2 30-60         9
3 60+          10
  • Note that the order matters! If we try reversing this, the code won’t work

Aggregating Data

real_congress %>% 
    mutate(age_cat = factor(age_cat,
                            levels = c('<30', '30-60', '60+')),
           age_cat = case_when(age < 30 ~ "<30",
                             (age >= 30 & age < 60) ~ "30-60",
                             age >= 60 ~ "60+")) %>% 
  group_by(age_cat) %>% 
  summarize(leg_pro = mean(legislation_passed))
Error in `mutate()`:
ℹ In argument: `age_cat = factor(age_cat, levels = c("<30", "30-60",
  "60+"))`.
Caused by error:
! object 'age_cat' not found
  • That is because the code is sequential, and the age_cat variable isn’t in our data until we make it

Data Structure

Data Structure

  • When performing operations with data, it is important to understand the Dimensions of the data
  • To do so, we can use several functions such as summary(), or names()
  • It is also important to be able to take a closer look at a single variable
  • Data can also be stored in different types of objects (we’ve covered vectors and dataframes/tibbles, lists are another way data can be stored)

Summary

  • Using summary() on a full dataset will give you summary statistics for each variable in the data
real_congress %>% 
  summary()
  full_name            party              state                age       
 Length:4           Length:4           Length:4           Min.   :20.00  
 Class :character   Class :character   Class :character   1st Qu.:35.75  
 Mode  :character   Mode  :character   Mode  :character   Median :43.00  
                                                          Mean   :42.75  
                                                          3rd Qu.:50.00  
                                                          Max.   :65.00  
                                                                         
  years_served    votes_received  legislation_passed
 Min.   : 4.000   Min.   :20000   Min.   : 6.00     
 1st Qu.: 5.000   1st Qu.:22000   1st Qu.: 9.00     
 Median : 6.000   Median :24000   Median :11.00     
 Mean   : 7.333   Mean   :25333   Mean   :10.75     
 3rd Qu.: 9.000   3rd Qu.:28000   3rd Qu.:12.75     
 Max.   :12.000   Max.   :32000   Max.   :15.00     
 NA's   :1        NA's   :1                         
  • Note that this works exactly the same as:
summary(real_congress)

Data Dimensions

  • If we just want the names of columns for each variable, we can use the names() function
names(real_congress)
[1] "full_name"          "party"              "state"             
[4] "age"                "years_served"       "votes_received"    
[7] "legislation_passed"
  • Additionally, it might be useful to know the dimensions of our data. To do this, we can use the nrow() and ncol() functions
nrow(real_congress)
[1] 4
ncol(real_congress)
[1] 7

Summarizing Single Variable

  • If we want to analyze just a single variable, or apply a function to a single variable, we can use the $ to pull a single variable from a dataset:
summary(real_congress$age)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  20.00   35.75   43.00   42.75   50.00   65.00 
  • We can also use this format to create frequency tables of our data, using the table() function
table(real_congress$party)

 Democratic Libertarian  Republican 
          2           1           1 
  • To get the dimensions of a single variable, we can use the length() function:
length(real_congress$party)
[1] 4

Practice and Live Coding

  • Using the same data from earlier, answer the following questions (feel free to work together!):
  1. How many rows does our practice data have?
  2. How many columns?
  3. What is the mean, median, and 3rd quantile of number bills a member of Congress has sponsored, and number of years served in congress?
  4. What is the number of members of Congress in each party?
  5. What is the average number of bills sponsored by age category?
  6. What is the median number of bills sponsored by party?

Lists

Lists

  • Lists are incredibly useful for certain tasks
  • One benefit of lists is that multiple dataframes can be stored together, which is helpful for any iterative process (covered more in day 4)
  • Lists can also store multiple vectors together
nums <- list(c(1,2,3,4,5), c(6,7,8,9,10))

nums
[[1]]
[1] 1 2 3 4 5

[[2]]
[1]  6  7  8  9 10

Lists Using Lists

  • Using lists, we can extract a single item from the list, using brackets like so:
nums[[1]]
[1] 1 2 3 4 5
  • If needed, we can also merge them using bind_rows() and/or bind_cols() depending on our application (bind_rows() only works with dataframes in this context)
bind_cols(nums)
# A tibble: 5 × 2
   ...1  ...2
  <dbl> <dbl>
1     1     6
2     2     7
3     3     8
4     4     9
5     5    10

Adding items to lists

  • We can also add items to lists using the following approach:
list[[index]] <- item to add to list
  • Using the example we were just working with
nums[[3]] <- c(11,12,13,14,15)

nums
[[1]]
[1] 1 2 3 4 5

[[2]]
[1]  6  7  8  9 10

[[3]]
[1] 11 12 13 14 15

Dataframes and Lists

  • Lists can be made of dataframes or vectors
a_list <- list(tibble(x = c(1,2,3,4), y = c(1,1,1,1)), 
               tibble(x = c(100,100,100,2), y = c(2,2,2,2)))

a_list
[[1]]
# A tibble: 4 × 2
      x     y
  <dbl> <dbl>
1     1     1
2     2     1
3     3     1
4     4     1

[[2]]
# A tibble: 4 × 2
      x     y
  <dbl> <dbl>
1   100     2
2   100     2
3   100     2
4     2     2

Lists and Dataframes

  • We can add in objects the exact same way we did with vectors
  • We can also combine these dataframes using bind_rows() or bind_cols()
bind_rows(a_list)
# A tibble: 8 × 2
      x     y
  <dbl> <dbl>
1     1     1
2     2     1
3     3     1
4     4     1
5   100     2
6   100     2
7   100     2
8     2     2
bind_cols(a_list)
# A tibble: 4 × 4
  x...1 y...2 x...3 y...4
  <dbl> <dbl> <dbl> <dbl>
1     1     1   100     2
2     2     1   100     2
3     3     1   100     2
4     4     1     2     2

Even more on Lists

  • If our dataframes in a given list don’t have the same variables, it will still work, but those rows will be filled in as NA
a_list[[3]] <- tibble(x = c(4,5,6,7), z = c('alpha', 'delta', 'bravo', 'echo'))

bind_rows(a_list)
# A tibble: 12 × 3
       x     y z    
   <dbl> <dbl> <chr>
 1     1     1 <NA> 
 2     2     1 <NA> 
 3     3     1 <NA> 
 4     4     1 <NA> 
 5   100     2 <NA> 
 6   100     2 <NA> 
 7   100     2 <NA> 
 8     2     2 <NA> 
 9     4    NA alpha
10     5    NA delta
11     6    NA bravo
12     7    NA echo