From our exploratory analysis on bike and microvehicle incident trends over the last 3 years, we observed that these number of these incidents had an overall seasonal pattern … just like weather.
So, naturally, we wanted to see if we could build a model to try and predict the number of bike and microvehicle incidents in NYC. To do this, on top of the data we obtained from the New York City Motor-Vehicle-Collisions-Crashes web page, we also incorporate weather data from NOAA.
In specific, from the weather data we obtained three key variables - average daily maximum temperature, average daily minimum temperature, and average daily precipitation. The ‘average’ here refers to how we averaged these measurements from the 3 weather stations located in New York City: The one in Central Park, one by JFK Airport, and another one by La Guardia Airport.
The main quest of this page is to see if we could:
Subset crash_dat to isolate only the vairables we need.
Here’s a sneak peak of our this subset-ed data:
| date | n_incidents | n_injured | n_killed | ny_tmin | ny_tmax | ny_prcp | 
|---|---|---|---|---|---|---|
| 2017-01-01 | 11 | 9 | 0 | 3.133333 | 9.800000 | 0.00000 | 
| 2017-01-02 | 8 | 7 | 0 | 2.400000 | 5.566667 | 59.33333 | 
| 2017-01-03 | 8 | 5 | 0 | 4.433333 | 7.400000 | 126.00000 | 
| 2017-01-04 | 10 | 6 | 0 | 1.666667 | 12.033333 | 0.00000 | 
| 2017-01-05 | 2 | 2 | 0 | -2.300000 | 1.666667 | 0.00000 | 
Here’s a graph showing how the number of incidents tracks remarkably well with NYC’s daily minimum and daily maximum temperature.
df %>% 
  ggplot(aes(x = date, y = n_incidents, color = "Number of Incidents")) + 
  geom_point(alpha = .5) + 
  geom_point(aes(x = date, y = ny_tmin, color = "Minimum Temp"), alpha = .5) +
geom_point(aes(x = date, y = ny_tmax, color = "Maximum Temp"), alpha = .5) + 
    labs(
    title = "New York's weather pattern and Number of Incidents", 
    y = "Number of Incidents",
    x = "Time"
  )
Create a cross validation dataset that splits the data: 80% for training and 20% for testing
cv_df = 
  crossv_mc(df, 100)
cv_df = cv_df %>% 
  mutate(
    train = map(train, as_tibble),
    test = map(test, as_tibble) 
    ) Fit different models on the train dataset and compute RMSE values of the corresponding models
cv_df = cv_df %>% 
  mutate(
    linear_mod = map(.x = train, ~lm(n_incidents ~ ny_prcp + ny_tmin + ny_tmax, data = .x)),
    smooth_mod = map(.x = train, ~gam(n_incidents ~ s(ny_tmax) + s(ny_tmin) + s(ny_prcp), data = .x))
  ) %>% 
  mutate(
    rmse_linear = map2_dbl(.x = linear_mod, .y = test, ~rmse(model = .x, data =.y)),
    rmse_smooth = map2_dbl(.x = smooth_mod, .y = test, ~rmse(model = .x, data =.y))
  )Plot RMSE’s to compare the different models
cv_df %>% 
  select(starts_with("rmse")) %>% 
  pivot_longer(
    everything(),
    names_to = "model",
    values_to = "rmse",
    names_prefix =  "rmse_"
  ) %>% 
  ggplot(aes(x = model, y =rmse, fill = model)) + 
  geom_boxplot() + 
  labs(
    title = "Comparing RMSEs of the linear model to the smooth model"
  )
We note that the Smooth model has a lower RMSE, so we will build a model based off that.
Fitting a smooth model using daily average maximum temperature and average daily precipitation to predict number of incidents
smooth = gam(n_incidents ~ s(ny_tmax) + + s(ny_tmin) + s(ny_prcp), data = df) df %>% 
  add_predictions(smooth) %>% 
  ggplot(aes(x = date, y = n_incidents, color = "Number of Incidents")) + 
  geom_point(alpha = .5) + 
  geom_smooth(aes(y = pred, color = "Predicted Number of Incidents")) + 
  labs(
    title = "Predicted Vs. Actual Number of Incidents",
    x = "Date",
    y = "Number of Incidents"
  )
We see that the smooth line tracks really well the observed number of predictions. Note however, this is not the most thorough model for a couple of reasons discussed in the summary section of this page.
We first assessing the relationship between Number of Incidents and our predictor variables
p1 = df %>% 
  ggplot(aes(x = ny_tmax, y = n_incidents)) + 
  geom_point() + 
  labs(
    y = "Number of Incidents",
    x = "Daily Maximum Temperature"
  )
p2 = df %>% 
  ggplot(aes(x = ny_tmin, y = n_incidents)) + 
  geom_point() + 
    labs(
    y = "Number of Incidents",
    x = "Daily Minimum Temperature"
  )
p3 = df %>% 
  ggplot(aes(x = ny_prcp, y = n_incidents)) + 
  geom_point() + 
    labs(
    y = "Number of Incidents",
    x = "Daily Precipitation"
  )
p1 / p2 / p3
Bootstrapping using modelr
df_boot_results = df %>% 
  bootstrap(1000, id = "strap_number") %>% 
  mutate(
    strap = map(strap, as_tibble), 
    models = map(.x = strap, ~gam(n_incidents ~ s(ny_tmax) + s(ny_tmin) + s(ny_prcp), data = .x)),
    results = map(models, broom::tidy)
  ) %>% 
  select(strap_number, results) %>% 
  unnest(results) Computing mean parameter estimates and constructing confidence intervals based on bootstrap confidence intervals
smooth_model = df_boot_results %>% 
  group_by(term) %>% 
  summarize(
    mean_est = mean(edf),
    sd_est = sd(edf),
     ci_lower = quantile(edf, 0.025),
    ci_upper = quantile(edf, 0.975)
  )
smooth_model %>% knitr::kable()| term | mean_est | sd_est | ci_lower | ci_upper | 
|---|---|---|---|---|
| s(ny_prcp) | 5.161344 | 1.591514 | 2.773001 | 8.314164 | 
| s(ny_tmax) | 5.068845 | 2.113530 | 1.000000 | 8.473610 | 
| s(ny_tmin) | 4.949790 | 1.147207 | 2.972163 | 7.571552 | 
Model Equation:
\[Where:\] \[tmax= average\:daily\:max\:temp\:in\:NYC\] \[tmin= average\:daily\:min\:temp\:in\:NYC\] \[prcp= average\:daily\:precipiation\:in\:NYC\] Lastly, we show here that under repeated sampling, using bootstrap, we see that the distribution of our estimate is skewed
df_boot_results %>% 
  filter(term == "s(ny_prcp)")  %>% 
  ggplot(aes(x = edf)) + 
  geom_density()
In this section we follow up on one of our earlier observations on how incidents involving microvehicles and bikes (overall) displayed seasonality (like the weather.) We subsequently try to build a predictive model, using these (weather related) predictor: daily maximum temperature, daily minimum temperature, and precipitation.
To that end, we first use cross validation to compare a linear model and a smooth model. Using RMSE as a discriminator, we decide to use the smooth model to build our predictive equation. Lastly, we use bootsrapping to obtain non-parametric estimates and confidence intervals of the final model equation.
N.B:
Some (but not all) of the reasons that our model is not the best possible predictive model:
number_of_incidents, is a count variable that we’re treating as continuous.We would appreciate any feedback you may have on this and other sections.