Week 9: SOC 223

Author

Andrew Weatherman

Published

October 27, 2022

Preparation

Load the data:

data(bikes, package = "bayesrules")

Question 0:

There are 500 rows in the data set. Each row represents one day. The data set covers bike rentals by day over a date range between 2011-01-01 and 2012-12-31, with several dates missing. The highest observed ridership in the data set is 6946 on 2012-09-26. The highest windspeed occured on 2011-02-19 at 34 mph.

Question 1:

glue::glue('The R-squared between rides and feels-like temperature is {round(cor(bikes$rides, bikes$temp_feel), 2)}. There is a positive association between the two variables.

The R-squared between rides and windspeed is {round(cor(bikes$rides, bikes$windspeed), 2)}. There is a negative association between the two variables.')

The R-squared between rides and feels-like temperature is 0.58. There is a positive association between the two variables.

The R-squared between rides and windspeed is -0.19. There is a negative association between the two variables.

Question 2:

bikes <- bikes |> 
  mutate(wind_kph = windspeed * 1.61)

glue::glue('The R-squared between rides and windspeed, now in Kph, is {round(cor(bikes$rides, bikes$wind_kph), 2)}. There is a negative association between the two variables. As we are scaling the variable by an a stable factor, the coorelation is not expected to change.')

The R-squared between rides and windspeed, now in Kph, is -0.19. There is a negative association between the two variables. As we are scaling the variable by an a stable factor, the coorelation is not expected to change.

Question 3:

wind_mph <- lm(data = bikes, rides ~ windspeed)
wind_kph <- lm(data = bikes, rides ~ wind_kph)

broom::tidy(wind_mph)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   4205.      177.      23.8  5.99e-84
2 windspeed      -55.5      12.5     -4.44 1.13e- 5

broom::tidy(wind_kph)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   4205.     177.       23.8  5.99e-84
2 wind_kph       -34.5      7.78     -4.44 1.13e- 5

glue::glue('According to the model, as windspeed increases by one mile per hour, the number of riders are expected to decrease by {abs(round(broom::tidy(wind_mph)[2,2], 2))}. And as windspeed increases by one kilometer per hour, the number of rides are expected to decrease by {abs(round(broom::tidy(wind_kph)[2,2], 2))}. 
           
Since we are scaling by a constant, the estimate and standard error are expected to be different. Because an increase in one mile per hour and one kilometer per hour are not equivalent -- while still measuring the same explanatory variable -- our estimate must be different to compensate for an equivalent prediction. In fact, the difference between the two is equal to the scale (1.6x).

On the other hand, our model outputs analogous measures for P-value and statistic. Since these, again, are measuring the same variable but are simply scaled by a value, we would not expect either of these variables to be different. Similar to how our R-squared value was the same in question two when shifting to kilometers per hour, the raw statistical measures will be the same here.')

According to the model, as windspeed increases by one mile per hour, the number of riders are expected to decrease by 55.52. And as windspeed increases by one kilometer per hour, the number of rides are expected to decrease by 34.49. 
           
Since we are scaling by a constant, the estimate and standard error are expected to be different. Because an increase in one mile per hour and one kilometer per hour are not equivalent -- while still measuring the same explanatory variable -- our estimate must be different to compensate for an equivalent prediction. In fact, the difference between the two is equal to the scale (1.6x).

On the other hand, our model outputs analogous measures for P-value and statistic. Since these, again, are measuring the same variable but are simply scaled by a value, we would not expect either of these variables to be different. Similar to how our R-squared value was the same in question two when shifting to kilometers per hour, the raw statistical measures will be the same here.

Question 4:

glue::glue('At 20mph, the predicted number of riders is approximately {round(predict(wind_mph, tibble(windspeed=20)), 0)}. At 20kph, the predicted number of riders is approximately {round(predict(wind_kph, tibble(wind_kph=20)), 0)}.')

At 20mph, the predicted number of riders is approximately 3095. At 20kph, the predicted number of riders is approximately 3515.

Question 5:

bikes <- bikes |> 
  mutate(temp_c = (temp_feel - 32) / 1.8)

kph_c <- lm(data = bikes, rides ~ wind_kph + temp_c)
broom::tidy(kph_c)

# A tibble: 3 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    942.     256.        3.68 2.55e- 4
2 wind_kph       -19.8      6.46     -3.07 2.24e- 3
3 temp_c         143.       9.24     15.5  1.65e-44

glue::glue('For a one kilometer per hour increase in wind speed, our model predicts a {abs(round(broom::tidy(kph_c)[2, 2], 1))}-person decrease in riders. For a one degree increase in temperture (C), our model predicts a {abs(round(broom::tidy(kph_c)[3, 2], 1))}-person increase in riders.
           
           When the temperature is zero degrees celsius and there is no wind blowing, our model estimates that there will be {round(broom::tidy(kph_c)[1, 2], 0)} riders.')

For a one kilometer per hour increase in wind speed, our model predicts a 19.8-person decrease in riders. For a one degree increase in temperture (C), our model predicts a 143.2-person increase in riders.

When the temperature is zero degrees celsius and there is no wind blowing, our model estimates that there will be 942 riders.

Question 6:

pred_df <- tibble(
  situation = 1:3,
  temp_c = c(25, 15, 10),
  wind_kph = c(15, 5, 40)
)

for (i in 1:nrow(pred_df)) {
  temp <- pred_df[i, 2]
  wind <- pred_df[i, 3]
  
  print(glue::glue('When the temperature is {temp}C and the wind speed is {wind}kph, our model predicts that there will be approximately {round(predict(kph_c, pred_df)[i], 0)} riders.'))
}

When the temperature is 25C and the wind speed is 15kph, our model predicts that there will be approximately 4226 riders.
When the temperature is 15C and the wind speed is 5kph, our model predicts that there will be approximately 2992 riders.
When the temperature is 10C and the wind speed is 40kph, our model predicts that there will be approximately 1581 riders.

Question 7:

bikes <- bikes |> mutate(weekend = as.numeric(weekend))

with_wk <- lm(rides ~ wind_kph + temp_c + weekend, data = bikes)

broom::tidy(with_wk)

# A tibble: 4 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   1215.     252.        4.82 1.93e- 6
2 wind_kph       -20.4      6.26     -3.26 1.20e- 3
3 temp_c         140.       8.96     15.7  3.26e-45
4 weekend       -714.     122.       -5.83 1.02e- 8

glue::glue('For a one kilometer per hour increase in wind speed, our model predicts a {abs(round(broom::tidy(with_wk)[2, 2], 1))}-person decrease in riders. For a one degree increase in temperture (C), our model predicts a {abs(round(broom::tidy(with_wk)[3, 2], 1))}-person increase in riders. When we are estimating on a weekend, our model predicts that there will be a {abs(round(broom::tidy(with_wk)[4, 2], 1))}-person decrease in riders.')

For a one kilometer per hour increase in wind speed, our model predicts a 20.4-person decrease in riders. For a one degree increase in temperture (C), our model predicts a 140.3-person increase in riders. When we are estimating on a weekend, our model predicts that there will be a 713.6-person decrease in riders.

Question 8:

Calculating average temperature and wind speed:

on_weekend <- predict(
  with_wk,
  tibble(
    wind_kph = mean(bikes$wind_kph),
    temp_c = mean(bikes$temp_c),
    weekend = 1
  )
)

on_weekday <- predict(
  with_wk,
  tibble(
    wind_kph = mean(bikes$wind_kph),
    temp_c = mean(bikes$temp_c),
    weekend = 0
  )
)

glue::glue('On a weekday with average wind speed and average temperature, our model predicts that there will be {round(on_weekday, 0)} riders. While those same conditions hold on a weekend, our model suggests that there will be {round(on_weekend, 0)} riders.')

On a weekday with average wind speed and average temperature, our model predicts that there will be 3683 riders. While those same conditions hold on a weekend, our model suggests that there will be 2970 riders.

biggest_diff <- bikes |> 
  mutate(
  pred=predict(with_wk, bikes),
  resid = abs(rides - pred)
) |> 
  slice_max(resid)

glue::glue('The model was most incorrect on {biggest_diff$date}. On this day, the model predicted {round(biggest_diff$pred, 0)} riders but there were only {biggest_diff$rides} riders observed (-{round(biggest_diff$resid, 0)}). Hurricane Sandy, the largest Atlantic hurricane on record, made landfall on this day. The combination of wind speed and excess weather events not covered by the model create an outlier that our model is not equipped to handle.')

The model was most incorrect on 2012-10-29. On this day, the model predicted 3510 riders but there were only 20 riders observed (-3490). Hurricane Sandy, the largest Atlantic hurricane on record, made landfall on this day. The combination of wind speed and excess weather events not covered by the model create an outlier that our model is not equipped to handle.