Week 1: SOC 223

Author

Andrew Weatherman

Published

September 5, 2022

Question 1:

Run:

packages <- c('causact', 'dplyr', 'igraph')

installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

invisible(lapply(packages, library, character.only = TRUE))

Question 2:

R’s solution to overlapping object names is “masking.” That is, if two or more packages share object names – in this case, as_data_frame is an exported object from both dplyr and igraph – R will default to using the object from the last loaded package. Such that igraph was loaded after dplyr in the example, simply calling as_data_frame() will run the function pulled from igraph.

Instead of loading an entire library, the best practice is to utilize namespaces to ‘pluck’ individual functions from a package. If you need to reference just one chapter from a book, you would not read the entire book – rather just the chapter. In the same vein, if I were to just need the map function from the purrr package, I would use purrr::map to reference that single function, not load the entire package. Not only does this increase code clarity, but it lessens the chance of accidental masking errors.

When writing open-source code, it is best to leverage namespaces (package::function) to make it abundantly clear what function you are referencing where.

Masking conflicts can be checked with the base conflicts function. Often times, however, these might be unnecessarily verbose. Instead of adding extra dependencies – which should always be kept at a minimum during package development – authors will add common utilities that might clutter the conflicts output: The magrittr pipe (%>%) is a common example; while this would return in the conflicts function, it is not functionally different cross-package. A work-around is to use the conflict_scout() function from conflicted.

Note: This example is a bit wonky because as_data_frame() is in the tibble() namespace and not dplyr. Loading tibble here will show the intended conflict.

library(tibble)
library(igraph)

# if ('conflicted' %in% rownames(installed.packages()) == FALSE) {
  # install.packages('conflicted')
# }

conflicted::conflict_scout()
3 conflicts:
* `as_data_frame`: igraph, tibble
* `decompose`    : igraph, stats
* `spectrum`     : igraph, stats

Similarly, it is possible to check where individual conflicts are originating from:

library(dplyr)
library(igraph)

getAnywhere("as_data_frame")$where
[1] "package:dplyr"    "package:igraph"   "package:tibble"   "namespace:igraph"
[5] "namespace:tibble"

Question 3:

n_distinct() returns the number of distinct rows found in the object.

Question 4:

# number of rows
nrow(causact::baseballData)
[1] 12145
# nummber of columns
ncol(causact::baseballData)
[1] 5
# type of `Home` and `HomeScore` column
sapply(causact::baseballData[c('Home', 'HomeScore')], class)
     Home HomeScore 
 "factor" "integer" 

Question 5:

One row in the data set represents a single game observation, while the Home and Visitor columns indicate who were the home and away teams during a game.

Question 6:

name <-
  c(
    "Wayne Gretzky",
    "Gordie Howe",
    "Jaromir Jagr",
    "Brett Hull",
    "Marcel Dionne",
    "Phil Esposito" ,
    "Mike Gartner",
    "Alex Ovechkin",
    "Mark Messier" ,
    "Steve Yzerman"
  )
goals <- c(894, 801, 766, 741, 731, 717, 708, 700, 694, 692)
year_started <- c(1979, 1946, 1990, 1986, 1971, 1963, 1979, 2005, 1979, 1983)
data <- dplyr::tibble(
  name = name,
  goals = goals,
  year_started = year_started
)
dplyr::glimpse(data)
Rows: 10
Columns: 3
$ name         <chr> "Wayne Gretzky", "Gordie Howe", "Jaromir Jagr", "Brett Hu…
$ goals        <dbl> 894, 801, 766, 741, 731, 717, 708, 700, 694, 692
$ year_started <dbl> 1979, 1946, 1990, 1986, 1971, 1963, 1979, 2005, 1979, 1983