<- read_csv("https://raw.githubusercontent.com/NicolasRestrep/223_course/main/Data/world_records.csv")
mario_kart <- read_csv("https://raw.githubusercontent.com/NicolasRestrep/223_course/main/Data/drivers.csv") drivers
Week 3: SOC 223
Preparation:
Load the data:
Question 1:
<-
three_laps subset(mario_kart, type == 'Three Lap')
Creating data sets with and without Rainbow Road:
<-
rr subset(three_laps, track == 'Rainbow Road')
<-
n_rr subset(three_laps, track != 'Rainbow Road')
Question 2:
<-
func function(x) {
|>
x summarize(
avg_time = mean(time),
sd_time = sd(time)
)
}sapply(list(rr, n_rr), func)
[,1] [,2]
avg_time 275.6336 113.7984
sd_time 91.81962 52.97595
The average time and standard deviations are nearly half when removing all rainbow road observations – which is understandable given the difficulty of that track. A lower standard deviation points to their being less variance in track times. In other words, there was a lower spread in recorded times, indicating that the rainbow road track decreased parity among players.
Question 3:
So for the next few questions, I will be using the data.table
package. It is generally much quicker than dplyr
on massive data sets, and I want to get some extra practice with it here. I hope you can trust that I know how to do this using dplyr
.
setDT(three_laps)
head(three_laps[, .(.N), by = track][order(-N)])
track N
1: Toad's Turnpike 124
2: Rainbow Road 99
3: Frappe Snowland 92
4: D.K.'s Jungle Parkway 86
5: Choco Mountain 84
6: Mario Raceway 82
Toad’s Turnpike is the track where the most records have been set.
Question 4:
head(three_laps[, .(.N), by = .(player, track)][order(-N)])
player track N
1: Penev Choco Mountain 26
2: Lacey D.K.'s Jungle Parkway 24
3: abney317 Rainbow Road 21
4: MR Toad's Turnpike 20
5: Penev Toad's Turnpike 18
6: MR Frappe Snowland 18
Penev has the set the most records set at an individual track (Choco Mountain; 26)
Question 5:
head(three_laps[, .(mean_time = mean(time)), by = track][order(-mean_time)])
track mean_time
1: Rainbow Road 275.6336
2: Wario Stadium 213.9587
3: Royal Raceway 158.3582
4: Bowser's Castle 134.0685
5: Kalimari Desert 125.9253
6: Banshee Boardwalk 125.9215
The highest average time recorded for an individual track was Rainbow Road.
The best time recorded at each track can be found with:
head(three_laps[order(time), head(time,1), by=track])
track V1
1: Wario Stadium 14.59
2: Choco Mountain 17.29
3: D.K.'s Jungle Parkway 21.35
4: Frappe Snowland 23.61
5: Luigi Raceway 25.30
6: Toad's Turnpike 30.31
Question 6:
:= fifelse(record_duration > 100, 1, 0)]
three_laps[, over_100
head(three_laps[, .(long_dur = sum(over_100)), by = .(player)][order(-long_dur)])
player long_dur
1: MR 81
2: MJ 50
3: Penev 27
4: VAJ 26
5: abney317 26
6: Zwartjes 24
MR holds the most amount of records standing for over 100 days with 81.
Question 7:
left_join(
three_laps,
drivers,by = 'player'
|>
) filter(nation != 'NA') |>
group_by(nation) |>
summarize(
records = n()
|>
) arrange(desc(records)) |>
mutate(
label = ifelse(
>= 4000,
records comma_format()(records),
NA
),color = ifelse(
row_number() == 1,
'#DB5461',
'#5C6F70'
)|>
) ggplot(
aes(records, reorder(nation, records), fill = color)
+
) geom_bar(stat = 'identity', size=3) +
geom_text(aes(label=label),
hjust = 1, nudge_x = -0.5,
fontface = "bold",
family='Avenir Next',
color = "white",
size = 3.25) +
scale_x_continuous(labels = scales::comma) +
scale_fill_identity(guide = "none") +
theme_pilot() +
theme(plot.title.position = 'plot',
panel.grid.minor.y = element_blank()) +
labs(
x = '',
y = '',
title = 'The United States dominates the Mario Kart record books',
subtitle = 'Number of track records held by country since 1997',
caption = 'Visualization by @andreweatherman'
)