Week 3: SOC 223

Author

Andrew Weatherman

Published

September 16, 2022

Preparation:

Load the data:

mario_kart <- read_csv("https://raw.githubusercontent.com/NicolasRestrep/223_course/main/Data/world_records.csv")
drivers <- read_csv("https://raw.githubusercontent.com/NicolasRestrep/223_course/main/Data/drivers.csv")

Question 1:

three_laps <- 
  subset(mario_kart, type == 'Three Lap')

Creating data sets with and without Rainbow Road:

rr <-
  subset(three_laps, track == 'Rainbow Road')

n_rr <-
  subset(three_laps, track != 'Rainbow Road')

Question 2:

func <-
  function(x) {
    x |> 
      summarize(
        avg_time = mean(time),
        sd_time = sd(time)
      )
  }
sapply(list(rr, n_rr), func)
         [,1]     [,2]    
avg_time 275.6336 113.7984
sd_time  91.81962 52.97595

The average time and standard deviations are nearly half when removing all rainbow road observations – which is understandable given the difficulty of that track. A lower standard deviation points to their being less variance in track times. In other words, there was a lower spread in recorded times, indicating that the rainbow road track decreased parity among players.

Question 3:

So for the next few questions, I will be using the data.table package. It is generally much quicker than dplyr on massive data sets, and I want to get some extra practice with it here. I hope you can trust that I know how to do this using dplyr.

setDT(three_laps)
head(three_laps[, .(.N), by = track][order(-N)])
                   track   N
1:       Toad's Turnpike 124
2:          Rainbow Road  99
3:       Frappe Snowland  92
4: D.K.'s Jungle Parkway  86
5:        Choco Mountain  84
6:         Mario Raceway  82

Toad’s Turnpike is the track where the most records have been set.

Question 4:

head(three_laps[, .(.N), by = .(player, track)][order(-N)])
     player                 track  N
1:    Penev        Choco Mountain 26
2:    Lacey D.K.'s Jungle Parkway 24
3: abney317          Rainbow Road 21
4:       MR       Toad's Turnpike 20
5:    Penev       Toad's Turnpike 18
6:       MR       Frappe Snowland 18

Penev has the set the most records set at an individual track (Choco Mountain; 26)

Question 5:

head(three_laps[, .(mean_time = mean(time)), by = track][order(-mean_time)])
               track mean_time
1:      Rainbow Road  275.6336
2:     Wario Stadium  213.9587
3:     Royal Raceway  158.3582
4:   Bowser's Castle  134.0685
5:   Kalimari Desert  125.9253
6: Banshee Boardwalk  125.9215

The highest average time recorded for an individual track was Rainbow Road.

The best time recorded at each track can be found with:

head(three_laps[order(time), head(time,1), by=track])
                   track    V1
1:         Wario Stadium 14.59
2:        Choco Mountain 17.29
3: D.K.'s Jungle Parkway 21.35
4:       Frappe Snowland 23.61
5:         Luigi Raceway 25.30
6:       Toad's Turnpike 30.31

Question 6:

three_laps[, over_100 := fifelse(record_duration > 100, 1, 0)]

head(three_laps[, .(long_dur = sum(over_100)), by = .(player)][order(-long_dur)])
     player long_dur
1:       MR       81
2:       MJ       50
3:    Penev       27
4:      VAJ       26
5: abney317       26
6: Zwartjes       24

MR holds the most amount of records standing for over 100 days with 81.

Question 7:

left_join(
  three_laps,
  drivers,
  by = 'player'
) |> 
  filter(nation != 'NA') |> 
  group_by(nation) |> 
  summarize(
    records = n()
  ) |> 
  arrange(desc(records)) |> 
  mutate(
    label = ifelse(
      records >= 4000,
      comma_format()(records),
      NA
    ),
    color = ifelse(
      row_number() == 1,
      '#DB5461',
      '#5C6F70'
    )
  ) |> 
  ggplot(
    aes(records, reorder(nation, records), fill = color)
  ) +
  geom_bar(stat = 'identity', size=3) +
  geom_text(aes(label=label),
             hjust = 1, nudge_x = -0.5,
             fontface = "bold",
             family='Avenir Next',
            color = "white",
            size = 3.25) +
  scale_x_continuous(labels = scales::comma) +
  scale_fill_identity(guide = "none") +
  theme_pilot() + 
  theme(plot.title.position = 'plot',
        panel.grid.minor.y = element_blank()) +
  labs(
    x = '',
    y = '',
    title = 'The United States dominates the Mario Kart record books',
    subtitle = 'Number of track records held by country since 1997',
    caption = 'Visualization by @andreweatherman'
  )