Reading Data into R and Aggregating Data

Day 2 of the RI workshop, Summer 2024

Austin Cutler

FSU

Class Today

Going over the HW
Different types of data and getting them into R
Reinforcing what we’ve learned so far
- Different types of data
- Functions used to clean data
- How we might want to clean data
Learning more advanced techniques for cleaning data
- grouping data using group_by()
- creating summary tables using summarize()
Learning how to summarize data quickly
How to call a single variable
Lists

Questions about the homework?

HW Answers

library(tidyverse)

#load in the data
read_csv('class_toy_data.csv') -> data

#this is an easier way to accomplish what we need
#read_csv('class_toy_data.csv', skip = 1) -> data

#using rename instead
## first we remove the unnecessary row and variables
## then we use rename
data %>% 
  slice(-1) %>% 
  select(-c(x1,x11,x12,x13,x14)) %>% 
  rename('age'     = x2,
         'gender'  = x3,
         'educ'    = x4,
         'income'  = x5,
         'app_rat' = x6,
         'pid'     = x7,
         'opp_tex' = x8,
         'gun_opp' = x9,
         'abo_opp' = x10) -> data

# now lets get the data formatted properly
data %>% 
  mutate(age     = parse_number(age),
         income  = parse_number(income),
         app_rat = parse_number(app_rat),
         gun_opp = parse_number(gun_opp),
         abo_opp = parse_number(abo_opp)) -> data

# changing the negative value from approval to NA
data %>% 
  mutate(app_rat = case_when(app_rat < 0 ~ NA_real_,
                             TRUE ~ app_rat)) -> data

# keeping only republicans
data %>% 
  filter(pid == 'Republican') -> data

# saving the data
#write_csv(data, 'cutler_hw1_data')

# note that this could all technically be done in one big chunk
data %>% 
  slice(-1) %>% 
  select(-c(x1,x11,x12,x13,x14)) %>% 
  rename('age'     = x2,
         'gender'  = x3,
         'educ'    = x4,
         'income'  = x5,
         'app_rat' = x6,
         'pid'     = x7,
         'opp_tex' = x8,
         'gun_opp' = x9,
         'abo_opp' = x10) %>% 
  mutate(age     = parse_number(age),
         income  = parse_number(income),
         app_rat = parse_number(app_rat),
         gun_opp = parse_number(gun_opp),
         abo_opp = parse_number(abo_opp),
         app_rat = case_when(app_rat < 0 ~ NA_real_,
                             TRUE ~ app_rat)) %>% 
  filter(pid == 'Republican')

#alternative ways to accomplish getting the data in the right format:
read_csv('class_toy_data.csv', skip = 1)

#or if we hadn't done that
data %>% 
  mutate(across(c(age,income,app_rat,gun_opp,abo_opp), ~parse_number(.)))

#or we could even just
data %>% 
  mutate(across(where(is.numeric), parse_number))

Brief review

What are the types of data we can encounter?

Show Answer

Numeric, string/character, factor, logical
What is an object in R and where are they shown?

Show Answer

Created data, can be either a vector or data frame; global environment
What is this %>% called?

Show Answer

A “Pipe”
What functions did we learn about last class to clean data?

Show Answer

mutate(), select(), slice(), filter(), parse_number()
What functions can be used to rename variables?

Show Answer

rename(), clean_names()

Other forms of Data

Data can come in forms other than a csv
Below are some common data formats

Data Format	Function
`.rds`	`readRDS()`
`.RDATA`	`load()`
`.txt'`	`read.delim()`
`.dta`	`read_dta()`
`.sav`	`read_sav()`

Note that the functions read_dta() and read_sav() come from the haven package

Cleaning Data

The main movers for cleaning data are mutate(), select(), and filter()
Functions can be used within other functions, and case_when() is very useful for shaping data when combined with mutate()

Cleaning Data Practice

library(tidyverse)

tibble("Full Name" = c("John Smith", "Jimmy Dean", "Robert Williams", 
                       "Emily Davis", "Michael Brown"),
  "Political Affiliation" = c("Democratic", "Republican", NA, "Democratic",
                              "Libertarian"),
  "Represented State" = c("California", "Texas", "New York", NA, "Florida"),
  "Politician Age" = c(45, 65, 60, 41, 20),
  "Years Served" = c(6, NA, 2, 4, 12),
  "Votes Received" = c(24000, NA, 15000, 20000, 32000),
  "Legislation Passed" = c(12, 10, NA, 6, 15)) -> real_congress

real_congress

# A tibble: 5 × 7
  `Full Name`     `Political Affiliation` `Represented State` `Politician Age`
  <chr>           <chr>                   <chr>                          <dbl>
1 John Smith      Democratic              California                        45
2 Jimmy Dean      Republican              Texas                             65
3 Robert Williams <NA>                    New York                          60
4 Emily Davis     Democratic              <NA>                              41
5 Michael Brown   Libertarian             Florida                           20
# ℹ 3 more variables: `Years Served` <dbl>, `Votes Received` <dbl>,
#   `Legislation Passed` <dbl>

Cleaning Data Practice

real_congress %>% 
  janitor::clean_names() %>% 
  rename('party' = political_affiliation,
         'state' = represented_state,
         'age' = politician_age) %>% 
  filter(!is.na(party)) %>% 
  mutate(age_cat = case_when(age < 30 ~ "<30",
                             (age >= 30 & age < 60) ~ "30-60",
                             age >= 60 ~ "60+")) %>% 
  select(full_name,party,state,age_cat) -> real_congress_2

real_congress_2

# A tibble: 4 × 4
  full_name     party       state      age_cat
  <chr>         <chr>       <chr>      <chr>  
1 John Smith    Democratic  California 30-60  
2 Jimmy Dean    Republican  Texas      60+    
3 Emily Davis   Democratic  <NA>       30-60  
4 Michael Brown Libertarian Florida    <30

Live Coding

Now that we’ve covered the basics again, let’s go through a problem together, download the Live Coding 1 data on the course materials page

Live Coding

As a group, we are going to:

Rename all the variables
Make Categorical Variables for age, experience, legislator’s activeness using committee membership and bill sponsorship, and electoral outcome (let’s assume each district has 60,000 voters)
Remove NAs from any variables where there are NAs
Make the data set only have their name, party, years served, bills sponsored, and the categorical variables from above

Probably a break here (?)

Workshop Part 2

Aggregating Data

In some applications, it is useful to get aggregate level information about our data
We can use group_by() and summarize() to accomplish this
group_by works similarly to row_wise() from the homework, let’s start there

Aggregating Data

real_congress %>% 
  janitor::clean_names() %>% 
  rename('party' = political_affiliation,
         'state' = represented_state,
         'age' = politician_age) %>% 
  filter(!is.na(party)) -> real_congress

real_congress

# A tibble: 4 × 7
  full_name     party state   age years_served votes_received legislation_passed
  <chr>         <chr> <chr> <dbl>        <dbl>          <dbl>              <dbl>
1 John Smith    Demo… Cali…    45            6          24000                 12
2 Jimmy Dean    Repu… Texas    65           NA             NA                 10
3 Emily Davis   Demo… <NA>     41            4          20000                  6
4 Michael Brown Libe… Flor…    20           12          32000                 15

What are some ways we’d be interested in grouping this data?

Aggregating Data

Finding group numbers

real_congress %>% 
  group_by(party) %>% 
  summarize(sample = n())

# A tibble: 3 × 2
  party       sample
  <chr>        <int>
1 Democratic       2
2 Libertarian      1
3 Republican       1

What do you notice about the data produced here?

Aggregating Data

real_congress %>% 
  mutate(age_cat = case_when(age < 30 ~ "<30",
                             (age >= 30 & age < 60) ~ "30-60",
                             age >= 60 ~ "60+")) %>% 
  group_by(age_cat) %>% 
  summarize(n())

# A tibble: 3 × 2
  age_cat `n()`
  <chr>   <int>
1 30-60       2
2 60+         1
3 <30         1

Aggregating Data

We can also use this same style of coding to apply functions to categorical groups

real_congress %>% 
  group_by(party) %>% 
  summarize(age = mean(age))

# A tibble: 3 × 2
  party         age
  <chr>       <dbl>
1 Democratic     43
2 Libertarian    20
3 Republican     65

Aggregating Data

real_congress %>% 
    mutate(age_cat = case_when(age < 30 ~ "<30",
                             (age >= 30 & age < 60) ~ "30-60",
                             age >= 60 ~ "60+")) %>% 
  group_by(age_cat) %>% 
  summarize(leg_pro = mean(legislation_passed))

# A tibble: 3 × 2
  age_cat leg_pro
  <chr>     <dbl>
1 30-60         9
2 60+          10
3 <30          15

Notice how the data is ordered in a weird way, how could we fix that?

Show Answer

Change the variable from character to factor

Aggregating Data

real_congress %>% 
    mutate(age_cat = case_when(age < 30 ~ "<30",
                             (age >= 30 & age < 60) ~ "30-60",
                             age >= 60 ~ "60+"),
           age_cat = factor(age_cat,
                            levels = c('<30', '30-60', '60+'))) %>% 
  group_by(age_cat) %>% 
  summarize(leg_pro = mean(legislation_passed))

# A tibble: 3 × 2
  age_cat leg_pro
  <fct>     <dbl>
1 <30          15
2 30-60         9
3 60+          10

Note that the order matters! If we try reversing this, the code won’t work

Aggregating Data

real_congress %>% 
    mutate(age_cat = factor(age_cat,
                            levels = c('<30', '30-60', '60+')),
           age_cat = case_when(age < 30 ~ "<30",
                             (age >= 30 & age < 60) ~ "30-60",
                             age >= 60 ~ "60+")) %>% 
  group_by(age_cat) %>% 
  summarize(leg_pro = mean(legislation_passed))

Error in `mutate()`:
ℹ In argument: `age_cat = factor(age_cat, levels = c("<30", "30-60",
  "60+"))`.
Caused by error:
! object 'age_cat' not found

That is because the code is sequential, and the age_cat variable isn’t in our data until we make it

Data Structure

When performing operations with data, it is important to understand the Dimensions of the data
To do so, we can use several functions such as summary(), or names()
It is also important to be able to take a closer look at a single variable
Data can also be stored in different types of objects (we’ve covered vectors and dataframes/tibbles, lists are another way data can be stored)

Summary

Using summary() on a full dataset will give you summary statistics for each variable in the data

real_congress %>% 
  summary()

  full_name            party              state                age       
 Length:4           Length:4           Length:4           Min.   :20.00  
 Class :character   Class :character   Class :character   1st Qu.:35.75  
 Mode  :character   Mode  :character   Mode  :character   Median :43.00  
                                                          Mean   :42.75  
                                                          3rd Qu.:50.00  
                                                          Max.   :65.00  
                                                                         
  years_served    votes_received  legislation_passed
 Min.   : 4.000   Min.   :20000   Min.   : 6.00     
 1st Qu.: 5.000   1st Qu.:22000   1st Qu.: 9.00     
 Median : 6.000   Median :24000   Median :11.00     
 Mean   : 7.333   Mean   :25333   Mean   :10.75     
 3rd Qu.: 9.000   3rd Qu.:28000   3rd Qu.:12.75     
 Max.   :12.000   Max.   :32000   Max.   :15.00     
 NA's   :1        NA's   :1

Note that this works exactly the same as:

summary(real_congress)

Data Dimensions

If we just want the names of columns for each variable, we can use the names() function

names(real_congress)

[1] "full_name"          "party"              "state"             
[4] "age"                "years_served"       "votes_received"    
[7] "legislation_passed"

Additionally, it might be useful to know the dimensions of our data. To do this, we can use the nrow() and ncol() functions

nrow(real_congress)

[1] 4

ncol(real_congress)

[1] 7

Summarizing Single Variable

If we want to analyze just a single variable, or apply a function to a single variable, we can use the $ to pull a single variable from a dataset:

summary(real_congress$age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  20.00   35.75   43.00   42.75   50.00   65.00

We can also use this format to create frequency tables of our data, using the table() function

table(real_congress$party)


 Democratic Libertarian  Republican 
          2           1           1

To get the dimensions of a single variable, we can use the length() function:

length(real_congress$party)

[1] 4

Practice and Live Coding

Using the same data from earlier, answer the following questions (feel free to work together!):

How many rows does our practice data have?
How many columns?
What is the mean, median, and 3rd quantile of number bills a member of Congress has sponsored, and number of years served in congress?
What is the number of members of Congress in each party?
What is the average number of bills sponsored by age category?
What is the median number of bills sponsored by party?

Lists

Lists are incredibly useful for certain tasks
One benefit of lists is that multiple dataframes can be stored together, which is helpful for any iterative process (covered more in day 4)
Lists can also store multiple vectors together

nums <- list(c(1,2,3,4,5), c(6,7,8,9,10))

nums

[[1]]
[1] 1 2 3 4 5

[[2]]
[1]  6  7  8  9 10

Lists Using Lists

Using lists, we can extract a single item from the list, using brackets like so:

nums[[1]]

[1] 1 2 3 4 5

If needed, we can also merge them using bind_rows() and/or bind_cols() depending on our application (bind_rows() only works with dataframes in this context)

bind_cols(nums)

# A tibble: 5 × 2
   ...1  ...2
  <dbl> <dbl>
1     1     6
2     2     7
3     3     8
4     4     9
5     5    10

Adding items to lists

We can also add items to lists using the following approach:

list[[index]] <- item to add to list

Using the example we were just working with

nums[[3]] <- c(11,12,13,14,15)

nums

[[1]]
[1] 1 2 3 4 5

[[2]]
[1]  6  7  8  9 10

[[3]]
[1] 11 12 13 14 15

Dataframes and Lists

Lists can be made of dataframes or vectors

a_list <- list(tibble(x = c(1,2,3,4), y = c(1,1,1,1)), 
               tibble(x = c(100,100,100,2), y = c(2,2,2,2)))

a_list

[[1]]
# A tibble: 4 × 2
      x     y
  <dbl> <dbl>
1     1     1
2     2     1
3     3     1
4     4     1

[[2]]
# A tibble: 4 × 2
      x     y
  <dbl> <dbl>
1   100     2
2   100     2
3   100     2
4     2     2

Lists and Dataframes

We can add in objects the exact same way we did with vectors
We can also combine these dataframes using bind_rows() or bind_cols()

bind_rows(a_list)

# A tibble: 8 × 2
      x     y
  <dbl> <dbl>
1     1     1
2     2     1
3     3     1
4     4     1
5   100     2
6   100     2
7   100     2
8     2     2

bind_cols(a_list)

# A tibble: 4 × 4
  x...1 y...2 x...3 y...4
  <dbl> <dbl> <dbl> <dbl>
1     1     1   100     2
2     2     1   100     2
3     3     1   100     2
4     4     1     2     2

Even more on Lists

If our dataframes in a given list don’t have the same variables, it will still work, but those rows will be filled in as NA

a_list[[3]] <- tibble(x = c(4,5,6,7), z = c('alpha', 'delta', 'bravo', 'echo'))

bind_rows(a_list)

# A tibble: 12 × 3
       x     y z    
   <dbl> <dbl> <chr>
 1     1     1 <NA> 
 2     2     1 <NA> 
 3     3     1 <NA> 
 4     4     1 <NA> 
 5   100     2 <NA> 
 6   100     2 <NA> 
 7   100     2 <NA> 
 8     2     2 <NA> 
 9     4    NA alpha
10     5    NA delta
11     6    NA bravo
12     7    NA echo