Lab #4

Author

Andrew Weatherman

Published

September 28, 2022

Preparation

Load the data:

load(url('https://dssoc.github.io/datasets/congress.RData'))

Load the libraries:

library(dplyr)
library(data.table)
library(testthat)

Question 1:

Functions are self-contained scripts that help to automate repetitive tasks. In data science, for example, they can be used to easily share reproducible analysis.

Question 2:

Inside RStudio, function documentation can be retrieved by placing ? in front of a function name. Outside of RStudio, Stack Overflow is a good resource as is turning to vignettes on GitHub or a package’s website.

Question 3:

I mean, just use substr here from base R. Writing a function is a waste of time.

sentence <- c('you', 'only', 'understand', 'data', 'if', 'data', 'is', 'tidy')
substr(sentence, 1, 2)
[1] "yo" "on" "un" "da" "if" "da" "is" "ti"

Like if a function is required, for whatever reason, just mask substr but why would you…

first_two <- function(...) {
  substr(..., 1, 2)
}
first_two(sentence)
[1] "yo" "on" "un" "da" "if" "da" "is" "ti"

Question 4:

Instead of using mutate directly on the data frame, I am creating a function that calculates age given a column of years and then binds that result to the passed data frame. This is both more versatile, because the column containing years is not restricted to being named birthyear, and requires less typing down the road. Not necessary for this application, but I included dots as an argument out of habit. With very light tweaking, dots, for example, could be used to rename the appended age column.

mutate_age <- function(data, birthyear, ...) {
  
  # get birth years of each person
  b_year <- data[[deparse(substitute(birthyear))]]
  # set current year
  current_year <- as.numeric(format(Sys.time(), "%Y"))
  # calculate age for each birthyear
  age <- tibble(age=current_year - b_year)
  # bind columns to create data set
  bind_cols(data, age)
  
}

This can be called using the function itself (mutate_age(congress, birthyear)), or the data can be piped through (congress |> mutate_age(birthyear)). Either works.

congress <- mutate_age(congress, birthyear)

To calculate the average age by gender, we can now use group_by and summarize.

congress |> 
  group_by(gender) |> 
  summarize(avg_age = mean(age))
# A tibble: 2 × 2
  gender avg_age
  <fct>    <dbl>
1 F         60.6
2 M         60.5

Question 5:

This function will operate similar to the previous one.

mutate_area_code <- function(data, number, ...) {
  
  # get phone number of each person
  number <- data[[deparse(substitute(number))]]
  # extract area code
  area_code <- tibble(area_code=as.numeric(substr(number, 1, 3)))
  # bind columns
  bind_cols(data, area_code)
  
}

Apply it to the congress_contact data frame:

congress_contact <- mutate_area_code(congress_contact, phone)

Questions 6 and 7:

for loops are pretty inefficient in R. Using purrr, or furrr/foreach with parallel processing if you have the cores to make it go brrr, is more scalable, but I think that just using data.table in this context might get the best results.

k_oldest_congress <- function(data, k = 5, job = 'rep', ...) {
  
  # restrict k to only returning five rows or fewer
  k <- fifelse(k > 5, 5, k)
  
  # set data frame as a data table for efficiency
  data.table::setDT(data)
  
  # set current date
  date <- Sys.Date()
  
  # determine real age 
  data <- data[,real_age := difftime(date, birthdate, units='days')][type==job][order(-real_age)]
  
  # print name and age
  data[, .(full_name, real_age)][, head(.SD, k)]
  
}
k_oldest_congress(congress)
               full_name   real_age
1:             Don Young 32648 days
2: Eddie Bernice Johnson 31741 days
3:   Grace F. Napolitano 31374 days
4:    Bill Pascrell, Jr. 31322 days
5: Eleanor Holmes Norton 31183 days

Unit testing to check if k restricts to five rows or fewer:

test_that('k restricts to five rows', {
  
  # default -> k= 5
  expect_equal(nrow(k_oldest_congress(congress)), 5)
  # k > 5
  expect_equal(nrow(k_oldest_congress(congress, k = 100)), 5)
  # k < 1
  expect_equal(nrow(k_oldest_congress(congress, k = 1)), 1)
  
})
Test passed 🎊

Question 8:

I am still not sure exactly what I want to do my final project on. Potential data sources are FiveThirtyEight’s GitHub page, NYT’s GitHub page, or scouring Kaggle for sets of interest.