This project was produced as part of the University of Pennsylvania’s Master of Urban Spatial Analytics (MUSA) Spring 2023 Practicum course (MUSA 801) taught by Michael Fichman and Matthew Harris. Along with our instructors, we would like to give special thanks to Ian Smith and Laura Culp of Bicycle Transit Systems, and Waffiyah Murray of OTIS for providing insight and support over the course of the semester.

Ⅰ. Introduction

1-1. Project Overview

Philadelphia’s bike share system, Indego, has provided communities across the city access to public bikes for more than eight years. Today, the entire network consists of more than 200 stations offering more than 2,000 bikes to residents and visitors alike. The service area, however, is heavily concentrated in Center City with pockets of stations in surrounding neighborhoods. To reach more residents, in 2022 the Office of Transportation, Infrastructure, and Sustainability (OTIS) launched Indego’s multi-year expansion plan in partnership with the system’s operator, Bicycle Transit Systems (BTS). Guided by more than 1,300 responses to Indego’s suggest-a-station feature, the expansion plan aims to further densify areas of high ridership and includes proposals to enlarge the service area by fixing bike share docks in outer parts of the city and infilling areas on the fringe.

We created Station Planner for Indego’s planners. As they plan for Indego’s expansion, our tool lets them evaluate proposed station configurations in terms of:

1. Expected ridership: The total estimated annual ridership,

2. Accessibility: The total population with access to at least one station within a 5-minute walk, and;

3. Equity of the system: The socio-economic characteristics of those with access to the network.

This report will present the process and code our team used to create Station Planner’s predictive model. First, we discuss our data exploration and feature engineering process. Through qualitative research including discussions with OTIS and BTS, we engineered to features that correlate with customers’ choice and ability to use the Indego bike share system. Then we detail our model building processes, including the evaluation of our models in terms of accuracy and generalizability. Finally, our model is tested on Indego’s 2023 Q1 ridership data to demonstrate the applicability of our model.

Ⅱ. Exploring the Data & Feature Engineering

2-1. Set up

2-1-1. Analysis Grid

To facilitate our spatial analysis, we began by creating a fishnet grid across Philadelphia, with each cell a thousand feet square, to reflect a 5-minute walk, which is the estimated maximum amount of time someone would walk to get a bike (NACTO, 2015). Dividing Philadelphia’s geography into these grid cell squares allow us to model how spatial and demographic features vary over the city, and will ultimately help us predict future ridership across space.

# ======== Packages and Functions Set Up ================

# Rmarkdown global setting
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(cache = TRUE)
knitr::opts_chunk$set(fig.align = 'center')

# Modelling packages
library(reticulate)
use_condaenv("keras-tf", required = T)

library(performanceEstimation)
library(keras)
library(tensorflow)
library(viridis)
library(gridExtra)
library(tidyverse)
library(sf)
library(LiblineaR)
library(poissonreg)
library(vip)
library(mapboxapi)

#Package installs -------------------------------------------------------------
load.fun <- function(x) { 
  x <- as.character(x) 
  if(isTRUE(x %in% .packages(all.available=TRUE))) { 
    eval(parse(text=paste("require(", x, ")", sep=""))) 
    print(paste(c(x, " : already installed; requiring"), collapse=''))
  } else { 
    #update.packages()
    print(paste(c(x, " : not installed; installing"), collapse=''))
    eval(parse(text=paste("install.packages('", x, "')", sep=""))) 
    print(paste(c(x, " : installed and requiring"), collapse=''))
    eval(parse(text=paste("require(", x, ")", sep=""))) 
  } 
} 

########### Required Packages ###########
packages = c("bayesplot", "lme4","RcppEigen",
             "tidyverse", "tidyr", "AmesHousing", "broom", "caret", "dials", "doParallel", "e1071", "earth",
             "ggrepel", "glmnet", "ipred", "klaR", "kknn", "pROC", "rpart", "randomForest",
             "sessioninfo", "tidymodels","ranger", "recipes", "workflows", "themis","xgboost",
             "sf", "nngeo", "mapview", "spatialsample", "DiagrammeR", "vetiver", "pins")

for(i in seq_along(packages)){
  packge <- as.character(packages[i])
  load.fun(packge)
}

session_info()

#########################################

set.seed(717)
theme_set(theme_bw())

"%!in%" <- Negate("%in%")
g <- glimpse

# Import Packages (if you don't have one of them please do install.packages("name of packages"))
library(tidyverse)
library(sf)
library(tidycensus)
library(riem)
library(gridExtra)
library(rgeos)
library(FNN)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(ggcorrplot) # plot correlation plot
library(corrplot)
library(corrr)  # another way to plot correlation plot
library(kableExtra)
library(jtools) # for regression model plots
library(ggstance) # to support jtools plots
library(ggpubr) # plotting R^2 value on ggplot point scatter
library(broom.mixed) # needed for effects plots
library(knitr)
library(rmarkdown)
library(RSocrata)
library(viridis)
library(ggplot2)
library(stargazer)
library(XML)
library(data.table)
#library(ggpmisc)
library(patchwork)
library(raster)
library(classInt)   # for KDE and ML risk class intervals
library(tableHTML)
library(exactextractr)
library(sp)
library(units)
library(lubridate)
library(pscl)
library(cvms)
library(yardstick)
library(plotROC)
library(gganimate)
library(gifski)
library(mapview)
library(ggspatial)
library(RColorBrewer)
library(leaflet)
library(leafsync)
library(leaflet.extras)
library(leafem)
library(shiny)

# Etc
options(scipen=999)
options(tigris_class = "sf")

# Themes

mapTheme_a <- function(base_size = 10, title_size = 12, small_size = 8) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size, colour = "black", hjust = 0.5),
    plot.subtitle=element_text(size = base_size, colour = "black", hjust = 0.5, face="italic"),
    plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5),
    axis.ticks = element_blank(),
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.background = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    strip.text.x = element_text(size = small_size, face="italic"),
    strip.text.y = element_text(size = small_size, face="italic"),
    strip.background = element_rect(colour="transparent", fill="transparent"),
    legend.title = element_text(size = small_size),
    legend.text = element_text(size = small_size),
    legend.key.size = unit(0.4, "cm"))
}

mapTheme_b <- function(base_size = 8, title_size = 10, small_size = 6) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size, colour = "black", hjust = 0.5),
    plot.subtitle=element_text(size = base_size, colour = "black", hjust = 0.5, face="italic"),
    plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5),
    axis.ticks = element_blank(),
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.background = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    strip.text.x = element_text(size = base_size),
    strip.text.y = element_text(size = base_size),
    strip.background = element_rect(colour="transparent", fill="transparent"),
    legend.title = element_text(size = small_size),
    legend.text = element_text(size = small_size),
    legend.key.size = unit(0.25, "cm"))
}

plotTheme_a <- function(base_size = 10, title_size = 12, small_size = 8){
  theme(axis.text =  element_blank(), 
        axis.ticks = element_blank(), 
        text = element_text(size = 10),
        panel.background = element_rect(fill = greyPalette5[1]),
        axis.title.x = element_text(size = small_size),
        axis.title.y = element_text(size = small_size),
        plot.subtitle = element_text(hjust = 0.5, size = base_size),
        plot.title = element_text(hjust = 0.5, size = title_size),
        plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5),
        strip.background = element_rect(colour="transparent", fill="transparent"))
}

plotTheme_b <- function(base_size = 10, title_size = 12, small_size = 8){
  theme(axis.text =  element_text(size = small_size),
        text = element_text(size = 10),
        panel.background = element_rect(fill = greyPalette5[1]),
        axis.title.x = element_text(size = small_size),
        axis.title.y = element_text(size = small_size),
        plot.subtitle = element_text(hjust = 0.5, size = base_size,  face="italic"),
        plot.title = element_text(hjust = 0.5, size = title_size),
        plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5),
        strip.background = element_rect(colour="transparent", fill="transparent"),
        strip.text.x = element_text(size = small_size, face="italic"),
        strip.text.y = element_text(size = small_size, face="italic"))
}

plotTheme_c <- function(base_size = 8, title_size = 10, small_size = 6){
  theme(axis.text =  element_text(size = small_size),
        text = element_text(size = 8),
        panel.background = element_rect(fill = greyPalette5[1]),
        axis.title.x = element_text(size = small_size),
        axis.title.y = element_text(size = small_size),
        plot.subtitle = element_text(hjust = 0.5, size = base_size,  face="italic"),
        plot.title = element_text(hjust = 0.5, size = title_size),
        plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5),
        strip.background = element_rect(colour="transparent", fill="transparent"),
        strip.text.x = element_text(size = small_size, face="italic"),
        strip.text.y = element_text(size = small_size, face="italic"))
}

#Functions

q5 <- function(variable) {as.factor(ntile(variable, 5))}

qBr_a <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]],
                                  c(.01,.2,.4,.6,.8), na.rm=T),
                         digits = 3))
  }
}

qBr_b <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]]*100,0)/100,
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]],
                                  c(.01,.2,.4,.6,.8), na.rm=T),
                         digits = 3))
  }
}

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <- as.matrix(measureFrom)
  measureTo_Matrix <- as.matrix(measureTo)
  nn <- get.knnx(measureTo, measureFrom, k)$nn.dist
  output <- as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  return(output)  
}

myCrossValidate <- function(dataset, id, dependentVariable, indVariables) {
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  for (i in cvID_list) {
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    regression <-
      glm(countMVTheft ~ ., family = "poisson", 
          data = fold.train %>% 
            dplyr::select(-geometry, -id))
    thisPrediction <- 
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    allPredictions <-
      rbind(allPredictions, thisPrediction)
  }
  return(st_sf(allPredictions))
}


`%!in%` <- Negate(`%in%`)

source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

#color pallettes
bluePalette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
blue2Palette5 <- c("#08519c","#3182bd","#6baed6","#bdd7e7","#eff3ff")
orangePalette5 <- c("#FFFBCB","#FFDCA0","#FEBE75","#FE9F4A","#FD801F")
greyPalette5 <- c("#f7f7f7","#cccccc","#969696","#636363","#252525")
greenPalette5 <- c("#edf8e9","#bae4b3","#74c476","#31a354","#006d2c")
purplePalette5 <- c("#f2f0f7","#cbc9e2","#9e9ac8","#756bb1","#54278f")
indego_green <- "#AFF708"
igreenPalette52 <- c("#F2FFA1","#C2ED79","#92DC51","#61CA28","#31B800")
bluePalette10 <- c('white', '#f7fbff','#deebf7','#c6dbef','#9ecae1','#6baed6','#4292c6','#2171b5','#08519c','#08306b', '#081d58') #, '#000000')
blue2Palette10 <- c('#000000', '#081d58', '#08306b', '#08519c', '#2171b5', '#4292c6', '#6baed6', '#9ecae1', '#c6dbef', '#deebf7', '#f7fbff')

#mapview palletes

MV_greenPalette <- colorRampPalette(c("#edf8e9", "#006d2c"))


# ============= Set up analysis grid =====================

# Read in Philadelphia boundary
phl_boundary <- st_read("https://opendata.arcgis.com/datasets/063f5f85ef17468ebfebc1d2498b7daf_0.geojson") %>%
  st_transform("ESRI:102729") #%>% # transform so measurements are in feet 
  #st_union()

# set grid cell size (in feet)
gridSize <- 1000  #1000 feet

# Create grid based on Philadelphia boundary
fishnet <- st_make_grid(phl_boundary,
                        cellsize = gridSize,
                        square = TRUE) %>%
  .[phl_boundary] %>%
  st_sf() %>%
  mutate(unique_id = 1:n()) %>%
  st_transform("EPSG:4326")

# Create centroids of fishnet for later measuring
fishnet_centroids <- st_centroid(fishnet)

# Plot grid
ggplot() +
  annotation_map_tile("cartolight", forcedownload = FALSE, zoom = 12)+
  geom_sf(data = fishnet, fill = "transparent", color=bluePalette5[3], size=1)+
  labs(title = "Analysis Grid",
       subtitle = "size 1000*1000ft")+
  mapTheme_a()

2-1-2. Census data

As a brief introduction to Philadelphia, we collected 2020 census data for all block groups. By clicking on the layers tab in the right hand corner of the map below, you can interact with various data including population density, employment rates, and types of transit users. In terms of racial distribution, we can see that there is a higher density of white residents in Center City, in certain East neighborhoods facing the Delaware river and in the North-West of the city moving closer towards Wissahikon Valley Park. Most Black residents live in West Philadelphia and in central northern parts of the city. The majority of Hispanic and Latinx communities reside in North Philadelphia, while Asian residents concentrate in central and southern neighborhoods.

# Call the census to view a list of available variables:
acs_variable_list.2020 <- load_variables(2020, #year
                                         "acs5", #five year ACS estimates
                                         cache = TRUE)

acs_vars <- c("B01001_001E", #ACS total Pop estimate
              "B03002_012E", # Estimated total Hispanic or Latino
              "B03002_003E", # Estimated total White, not Hispanic or Latino
              "B03002_004E", # Estimated total Black, not Hispanic or Latino
              "B03002_005E", # Estimated total American Indian and Alaska Native population, not Hispanic or Latino
              "B03002_006E", # Estimated total Asian population, not Hispanic or Latino
              "B03002_007E", # Estimated total Native Hawaiian and Other Pacific Islander alone, not Hispanic or Latino
              "B03002_008E", # Estimated total other race alone, not Hispanic or Latino
              "B03002_009E", # Estimated total two or more races, not Hispanic or Latino
              "B03002_010E", # Estimated total two or more races including other race, not Hispanic or Latino
              "B03002_011E", # Estimated total two or more races excluding other race & 3 or more races, not Hispanic or Latino
              "B23025_004E", #Estimated Total population who are employed
              "B19013_001E", #Estimated Median Household Income
              "B01002_001E", # Median age
              "B01001_031E", "B01001_032E", "B01001_033E", "B01001_034E",  # Female bachelor age 18 to 24
              "B01001_007E", "B01001_008E", "B01001_009E", "B01001_010E",  # Male bachelor age 18 to 24
              "B08301_001E", # Means of transportation to work total
              "B08301_002E", # Means of transportation to work: Car
              "B08301_010E", # Means of transportation to work: Public transit (bus, subway, rail)
              "B08301_018E", # Means of transportation to work: Bicycle
              "B08301_019E", # Means of transportation to work: walked
              "B14007_002E" # Total:!!Enrolled in school
              ) 

blockgrps <- get_acs(geography = "block group",
                             year = 2020, 
                             variables = acs_vars, 
                             geometry = TRUE, 
                             state = 42, 
                             county = 101,
                             output = "wide") %>%
  dplyr::select (GEOID, all_of(acs_vars)) %>%
  mutate (total_other_race = B03002_007E + B03002_005E + B03002_008E + B03002_009E + B03002_010E + B03002_011E,
          fml_Bch = B01001_031E + B01001_032E + B01001_033E + B01001_034E, # Female bachelor age 18 to 24
          ml_Bch = B01001_007E + B01001_008E + B01001_009E + B01001_010E) %>%
  dplyr::select(-B01001_031E, -B01001_032E, -B01001_033E, -B01001_034E, -B01001_007E, -B01001_008E, -B01001_009E, -B01001_010E, -B03002_007E, -B03002_005E, -B03002_008E, -B03002_009E, -B03002_010E, -B03002_011E) %>%
  rename (total_pop = B01001_001E, #ACS total Pop estimate
          total_white = B03002_003E, #Estimated total White only population
          total_Black = B03002_004E, # Estimated total Black population
          total_Asian = B03002_006E, # Estimated total Asian population
          total_HispLat = B03002_012E, # Estimated total Hispanic or Latino
          total_employed = B23025_004E, #Total employed
          medIncome = B19013_001E, #Median HH income
          medAge = B01002_001E, #Median age
          total_commutes = B08301_001E, # Means of transportation to work total
          commute_car = B08301_002E, # Means of transportation to work: Car
          commute_pub = B08301_010E, # Means of transportation to work: public transit
          commute_bic = B08301_018E, # Means of transportation to work: Bicycle
          commute_wlk = B08301_019E, # Means of transportation to work: Walked
          total_school = B14007_002E # Total:!!Enrolled in school
          )

blockgrps <- blockgrps %>% 
  mutate(area = st_area(blockgrps),
         pct_white = round((total_white / total_pop) * 100,2),
         pct_Black = round((total_Black / total_pop) * 100,2),
         pct_Asian = round((total_Asian / total_pop) * 100,2),
         pct_HispLat = round((total_HispLat / total_pop) * 100, 2),
         pct_other_race = round((total_other_race / total_pop) * 100, 2),
         pct_employed = round((total_employed / total_pop) *100,2),
         pct_commute_car = round((commute_car/total_commutes) * 100,2),
         pct_commute_pub = round((commute_pub/total_commutes) * 100,2),
         pct_commute_bic = round((commute_bic/total_commutes) * 100,2),
         pct_commute_wlk = round((commute_wlk/total_commutes) * 100,2),
         pct_school = round((total_school / total_pop)*100,2),
         pop_density = round(total_pop / area * 1000000,2) #km2
         ) %>% 
  st_transform("EPSG:4326") %>%
  drop_units() %>%
  dplyr::select(-total_white, -total_Black, - total_Asian, -total_HispLat, -total_employed, -total_commutes, -commute_car, -commute_pub, -commute_bic, -commute_wlk, -total_school, -total_other_race, -fml_Bch, -ml_Bch)



# Extract centroids of block groups
blockgrps_centroids <- st_centroid(blockgrps) %>%
  st_sf() %>%
  st_transform(st_crs(blockgrps))


# ==== Plot Census Variables ==========

pal <- colorNumeric(palette = "Blues", domain = c(0,100), na.color = NA)
pal2 <- colorNumeric(palette = "Blues", blockgrps$pop_density, na.color = NA)


m222 <- leaflet(blockgrps) %>% 
  addProviderTiles(providers$CartoDB.Positron) %>%
  addControl(html = "<p style='font-size: 10pt; text-align:center; margin:0px; color:black;'>Figure 2-2-2. Census data</p>", position = "topright") %>% 
  
  #pct_school
  addPolygons(fillColor = ~pal(pct_school), fillOpacity = 0.8, weight = 1, color = "transparent", highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE), label = paste0(blockgrps$pct_school, "%"), group = "Percentage of Students") %>%
  
  #pct_commute_wlk
  addPolygons(fillColor = ~pal(pct_commute_wlk), fillOpacity = 0.8, weight = 1, color = "transparent", highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE), label = paste0(blockgrps$pct_commute_wlk, "%"), group = "Percentage of walk commuters") %>%
  
  #pct_commute_bic
  addPolygons(fillColor = ~pal(pct_commute_bic), fillOpacity = 0.8, weight = 1, color = "transparent", highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE), label = paste0(blockgrps$pct_commute_bic, "%"), group = "Percentage of bicycle commuters") %>%
  
  #pct_commute_pub
  addPolygons(fillColor = ~pal(pct_commute_pub), fillOpacity = 0.8, weight = 1, color = "transparent", highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE), label = paste0(blockgrps$pct_commute_pub, "%"), group = "Percentage of transit commuters") %>%
  
  #pct_commute_car
  addPolygons(fillColor = ~pal(pct_commute_car), fillOpacity = 0.8, weight = 1, color = "transparent", highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE), label = paste0(blockgrps$pct_commute_car, "%"), group = "Percentage of car commuters") %>%
  
  #pct_employed
  addPolygons(fillColor = ~pal(pct_employed), fillOpacity = 0.8, weight = 1, color = "transparent", highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE), label = paste0(blockgrps$pct_employed, "%"), group = "Percentage of employed") %>%
  
  #pct_white
  addPolygons(fillColor = ~pal(pct_white), fillOpacity = 0.8, weight = 1, color = "transparent", highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE), label = paste0(blockgrps$pct_white, "%"), group = "Percentage of white residents") %>%
  
  #pct_Black
  addPolygons(fillColor = ~pal(pct_Black), fillOpacity = 0.8, weight = 1, color = "transparent", highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE), label = paste0(blockgrps$pct_Black, "%"), group = "Percentage of Black Residents") %>%
  
   #pct_Asian
  addPolygons(fillColor = ~pal(pct_Asian), fillOpacity = 0.8, weight = 1, color = "transparent", highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE), label = paste0(blockgrps$pct_Asian, "%"), group = "Percentage of Asian Residents") %>%
  
   #pct_HispLat
  addPolygons(fillColor = ~pal(pct_HispLat), fillOpacity = 0.8, weight = 1, color = "transparent", highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE), label = paste0(blockgrps$pct_HispLat, "%"), group = "Percentage of Hispanic and Latino Residents") %>%
  
   #pct_other_race
  addPolygons(fillColor = ~pal(pct_other_race), fillOpacity = 0.8, weight = 1, color = "transparent", highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE), label = paste0(blockgrps$pct_other_race, "%"), group = "Percentage of other race") %>%
  
   #pop_density
  addPolygons(fillColor = ~pal2(pop_density), fillOpacity = 0.8, weight = 1, color = "transparent", highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE), label = paste0(blockgrps$pop_density, "ppl/km2"), group = "Population density") %>%
  
      addLayersControl(overlayGroups = c("Percentage of Students", "Percentage of walking commuters", "Percentage of bicycle commuters", "Percentage of transit commuters", "Percentage of car commuters", "Percentage of rent spent",  "Percentage of employed", "Percentage of white Residents", "Percentage of Black Residents", "Percentage of Asian Residents", "Percentage of Hispanic and Latino Residents", "Percentage of other race", "Population density"), options = layersControlOptions(collapsed = TRUE)) %>%
  hideGroup(c("Percentage of walk commuters", "Percentage of bicycle commuters", "Percentage of transit commuters", "Percentage of car commuters", "Percentage of employed", "Percentage of white Residents", "Percentage of Black Residents", "Percentage of Asian Residents", "Percentage of Hispanic and Latino Residents", "Percentage of other race", "Population density"))  %>%
  addMeasure(position = "bottomleft")  %>%
  addMouseCoordinates()

#sync(m222, m222)

m222

2-3. Indego Bike Share Ridership

The first set of data explored corresponds to the Indego bike share program’s ridership data from 2022, made up of more than 900,000 bike trips logged across 183 stations. As previously shown, the individual stations are highly concentrated in Center City and its immediate surrounding neighborhoods. The formal bike lanes extend throughout the city covering major roads, with the majority of buffered or protected lanes situated in the busy urban core.

To get a better gage of high and low performing stations, we chose to measure ridership by week. Given that several of our variables contained NA values, we imputed these null values with the spatial lag of the relevant variable to ensure that we can use this data for later predictions.

For better visualization, we aggregated station ridership data into quantiles, where the 5th quantile includes all stations with the highest weekly ridership. In other words, these stations on average saw more than 150 trips per week in 2022, with the top station seeing an average of 317 rides per week. The highest performing stations are located predominantly in the core of Center City and extend to parts of University City and Old City in the West and East respectively. The stations with the lowest average weekly ridership tend to be newer, and tend to be located on the outskirts of the existing Indego network. Besides the time necessary to build up a consistent user base, these areas also have lower population density and have more limited nearby options to dock their bikes at the end of a rider’s trip.

### data source(https://www.rideindego.com/about/data/)

## Indego Trip Data
q1_2022 <- st_read("./data/indego-trips-2022-q1.csv")
q2_2022 <- st_read("./data/indego-trips-2022-q2.csv")
q3_2022 <- st_read("./data/indego-trips-2022-q3.csv")
q4_2022 <- st_read("./data/indego-trips-2022-q4.csv")
indego_trips_2022 <- rbind(q1_2022, q2_2022, q3_2022, q4_2022) %>%
mutate(start_station = case_when(
    start_station == "3013" ~ "3286",
    start_station == "3108" ~ "3004",
    start_station == "3202" ~ "3296",
    start_station == "3206" ~ "3294",
    start_station == "3167" ~ "3295",
    start_station == "3011" ~ "3265",
    TRUE ~ start_station
  )) %>% filter(start_station != "3246" & start_station != "3241" & start_station != "3271" &  start_station != "3287")

## Indego Station Information
indego_stations <- st_read("data/indego-stations-2022-10-01.csv") %>% filter(Status == "Active")

## Indego Geojson
indego_Geojson <- st_read("data/Indego.json") %>% filter(kioskPublicStatus == "Active") %>% dplyr::select(-bikes)
print(length(unique(indego_Geojson$id))) 


# Aggregate ridership by week
ridership_by_week <- indego_trips_2022 %>% 
  mutate(week = lubridate::week(as.POSIXct(start_time, tz = "", "%m/%d/%Y %H:%M"))) %>%
  filter(passholder_type %!in% c("NULL", "Walk-up")) %>%
  group_by(start_station, week) %>%
  summarize(stn_lat = median(as.double(start_lat), na.rm = T),
            stn_lon = median(as.double(start_lon), na.rm = T),
            ridership_week_total = length(unique(trip_id)),
            ebike_ridership_week_total = length(unique(trip_id[bike_type == "electric"])), 
            standard_ridership_week_total = length(unique(trip_id[bike_type == "standard"])),
            guest_pass_total = length(unique(trip_id[passholder_type == "Day Pass"])),
            month_pass_total = length(unique(trip_id[passholder_type == "Indego30"])),
            annual_pass_total = length(unique(trip_id[passholder_type == "Indego365"]))) %>%
  group_by(start_station) %>%
  summarize(stn_lat = median(as.double(stn_lat), na.rm = T),
            stn_lon = median(as.double(stn_lon), na.rm = T),
            ridership_week = median(ridership_week_total, na.rm = T),
            ebike_ridership_week = median(ebike_ridership_week_total, na.rm = T),
            standard_ridership_week = median(standard_ridership_week_total, na.rm = T),
            guest_pass_week = median(guest_pass_total, na.rm = T),
            month_pass_week = median(month_pass_total, na.rm = T),
            annual_pass_week = median(annual_pass_total, na.rm = T),
            weeks_active = length(unique(week))) %>%
  left_join(indego_stations, by = c("start_station" = "Station_ID")) 

# Join to stations information
stations_ridership_by_week <- indego_Geojson %>% 
  mutate(id=as.character(id)) %>% 
  full_join(ridership_by_week, by=c("id"="start_station")) %>% 
  # create geometry
  st_as_sf(coords = c(x="longitude", y="latitude"), crs=st_crs("EPSG:4326")) %>%
  # remove certain stations
  filter(id != 3000 & id != 3121 & id != 3186) %>%
  rename("go_live" = "Day.of.Go_live_date",
         "station" = "id",
         "station_name" = "Station_Name",
         "status" = "Status") %>%
  mutate(go_live = as.POSIXct(go_live, tz="", format = "%m/%d/%Y")) %>% 
  mutate(go_live_year = as.numeric(format(go_live, "%Y"))) %>%
  subset(., !is.na(ridership_week)) %>%
  # remove unneccessary columns
  dplyr::select(-docksAvailable,
                -bikesAvailable,
                -classicBikesAvailable,
                -smartBikesAvailable, 
                -electricBikesAvailable, 
                -rewardBikesAvailable,
                -rewardDocksAvailable,
                -kioskStatus,
                -kioskPublicStatus,
                -kioskConnectionStatus, 
                -kioskType,
                -station_name,
                -addressStreet,           
                -addressCity,            
                -addressState,            
                -addressZipCode,         
                -closeTime,               
                -eventEnd,               
                -eventStart,              
                -isEventBased,           
                -isVirtual,               
                -kioskId,                
                -notes,                  
                -openTime,               
                -publicText,              
                -timeZone,               
                -trikesAvailable,         
                -latitude,               
                -longitude,               
                -coordinates,         
                -stn_lat,                 
                -stn_lon,
                -field_5,                 
                -field_6,                
                -field_7,                
                -field_8,               
                -field_9,                 
                -field_10,               
                -field_11,                
                -field_12,               
                -field_13,                
                -field_14,               
                -field_15,               
                -field_16,               
                -field_17,                
                -field_18,               
                -field_19)


# Join grid to ridership data
grid <- stations_ridership_by_week %>%
  dplyr::select(-station,
                -name,
                -go_live,
                -status,
                -go_live_year) %>%
  mutate(station_count = 1) %>%
  aggregate(., fishnet, sum) %>%
  mutate(ridership_week = replace_na(ridership_week, 0),
         ebike_ridership_week = replace_na(ebike_ridership_week, 0),
         standard_ridership_week = replace_na(standard_ridership_week, 0),
         station_count = replace_na(station_count, 0),
         guest_pass_week = replace_na(guest_pass_week, 0),
         month_pass_week = replace_na(month_pass_week, 0),
         annual_pass_week = replace_na(annual_pass_week, 0),
         totalDocks = replace_na(totalDocks, 0),
         weeks_active = replace_na(round((weeks_active/station_count),1), 0),
         unique_id = 1:n())

# Join grid to census data
grid_centroids <- st_centroid(grid) %>%
  dplyr::select(unique_id)

grid_join <- st_join(grid_centroids, blockgrps, join = st_intersects)

grid <- left_join(grid, st_drop_geometry(grid_join), by = "unique_id") %>%
  mutate(border_cell = ifelse(is.na(total_pop), "BORDER_CELL", "NOT_BORDER_CELL"))


# Calculate population per grid cell
# join grid centroids to blockgroups
joined_grid_blockgrps <- st_join((st_centroid(grid) %>%
                                    dplyr::select(unique_id)),
                                  blockgrps,
                                 join = st_intersects)

# group by geoid to find number of grid centroids per block group
cells_per_bg <- joined_grid_blockgrps %>%
  group_by(GEOID) %>%
  summarize(gridcells_per_bg = length(unique_id))

# join number of grid centroids per blockgroup to grid
grid <- left_join(grid, st_drop_geometry(cells_per_bg), by = "GEOID") %>%
  mutate(pop_per_cell = as.integer(total_pop / gridcells_per_bg)) # calculate total_pop/number_grid_centroids


# Set up stations df
stations <- st_join(stations_ridership_by_week, blockgrps,
                          join=st_intersects,
                          left = TRUE)


# Impute grid with zeros
grid_imputed_zeros <- grid %>%
  mutate(total_pop = ifelse(is.na(total_pop), 0, total_pop),
    pct_white = ifelse(total_pop > 0, pct_white,
                            0),
    pct_Black = ifelse(total_pop > 0, pct_Black,
                            0),
    pct_Asian = ifelse(total_pop > 0, pct_Asian,
                                0),
    pct_HispLat = ifelse(total_pop > 0, pct_HispLat,
                                0),
    pct_other_race = ifelse(total_pop > 0, pct_other_race,
                                0),
    pct_employed = ifelse(total_pop > 0, pct_employed,
                                0),
    pct_commute_car = ifelse(!is.na(pct_commute_car), pct_commute_car,
                                0),
    pct_commute_pub = ifelse(!is.na(pct_commute_pub), pct_commute_pub,
                                0),
    pct_commute_bic = ifelse(!is.na(pct_commute_bic), pct_commute_bic,
                                0),
    pct_commute_wlk = ifelse(!is.na(pct_commute_wlk), pct_commute_wlk,
                                0),
    pct_school = ifelse(total_pop > 0, pct_school,
                                0),
    medAge = ifelse(total_pop > 0, medAge,
                                0),
    pop_density = ifelse(!is.na(pop_density), pop_density, 0))

# Create coordinates for KNN measurement
grid_for_measuring_imputed_zeros <- st_centroid(grid_imputed_zeros) %>%
  dplyr::select() %>%
  st_transform("ESRI:102729") %>% # transform so measurements are in feet
  st_coordinates()

# Set up KNN
grid_neighbor_list <- knn2nb(knearneigh(grid_for_measuring_imputed_zeros, 8))

grid_spatial_weights <- nb2listw(grid_neighbor_list, style = "W")

# Lag variables
grid$lag_total_pop <- lag.listw(grid_spatial_weights, grid_imputed_zeros$total_pop, NAOK = FALSE)
grid$lag_pct_white <- lag.listw(grid_spatial_weights, grid_imputed_zeros$pct_white, NAOK = FALSE)
grid$lag_pct_Black <- lag.listw(grid_spatial_weights, grid_imputed_zeros$pct_Black, NAOK = FALSE)
grid$lag_pct_Asian <- lag.listw(grid_spatial_weights, grid_imputed_zeros$pct_Asian, NAOK = FALSE)
grid$lag_pct_HispLat <- lag.listw(grid_spatial_weights, grid_imputed_zeros$pct_HispLat, NAOK = FALSE)
grid$lag_pct_other_race <- lag.listw(grid_spatial_weights, grid_imputed_zeros$pct_other_race, NAOK = FALSE)
grid$lag_pct_employed <- lag.listw(grid_spatial_weights, grid_imputed_zeros$pct_employed, NAOK = FALSE)
grid$lag_pct_commute_car <- lag.listw(grid_spatial_weights, grid_imputed_zeros$pct_commute_car, NAOK = FALSE)
grid$lag_pct_commute_pub <- lag.listw(grid_spatial_weights, grid_imputed_zeros$pct_commute_pub, NAOK = FALSE)
grid$lag_pct_commute_bic <- lag.listw(grid_spatial_weights, grid_imputed_zeros$pct_commute_bic, NAOK = FALSE)
grid$lag_pct_commute_wlk <- lag.listw(grid_spatial_weights, grid_imputed_zeros$pct_commute_wlk, NAOK = FALSE)
grid$lag_pct_school <- lag.listw(grid_spatial_weights, grid_imputed_zeros$pct_school, NAOK = FALSE)
grid$lag_medAge <- lag.listw(grid_spatial_weights, grid_imputed_zeros$medAge, NAOK = FALSE)
grid$lag_pop_density <- lag.listw(grid_spatial_weights, grid_imputed_zeros$pop_density, NAOK = FALSE)

# Impute NAs with lagged variables
grid <- grid %>%
  mutate(total_pop_imputed = ifelse(is.na(total_pop), lag_total_pop, total_pop),
          pct_white_imputed = ifelse(is.na(pct_white), lag_pct_white, pct_white),
          pct_Black_imputed = ifelse(is.na(pct_Black), lag_pct_Black, pct_Black),
          pct_Asian_imputed = ifelse(is.na(pct_Asian), lag_pct_Asian, pct_Asian),
          pct_HispLat_imputed = ifelse(is.na(pct_HispLat), lag_pct_HispLat, pct_HispLat),
          pct_other_race_imputed = ifelse(is.na(pct_other_race), lag_pct_other_race, pct_other_race),
          pct_employed_imputed = ifelse(is.na(pct_employed), lag_pct_employed, pct_employed),
          pct_commute_car_imputed = ifelse(is.na(pct_commute_car), lag_pct_commute_car, pct_commute_car),
          pct_commute_pub_imputed = ifelse(is.na(pct_commute_pub), lag_pct_commute_pub, pct_commute_pub),
          pct_commute_bic_imputed = ifelse(is.na(pct_commute_bic), lag_pct_commute_bic, pct_commute_bic),
          pct_commute_wlk_imputed = ifelse(is.na(pct_commute_wlk), lag_pct_commute_wlk, pct_commute_wlk),
          pct_school_imputed = ifelse(is.na(pct_school), lag_pct_school, pct_school),
          medAge_imputed = ifelse(is.na(medAge), lag_medAge, medAge),
         pop_density_imputed = ifelse(is.na(pop_density), lag_pop_density, pop_density))

# Plot

ggplot()+
  geom_sf(data = blockgrps, fill=greyPalette5[1], color=greyPalette5[2], size = 0.1)+
  geom_sf(data=stations, shape=21, aes(size = q5(ridership_week), fill = q5(ridership_week)), alpha = 0.8, color=greyPalette5[4], stroke=1) +
  scale_fill_manual(values = igreenPalette52 ,labels = qBr(stations, "ridership_week")) +
  scale_size_discrete(labels = qBr(stations, "ridership_week"))+
  ylim(39.89, 40.01)+
  xlim(-75.23, -75.12)+
  labs(title = "Indego Stations by Weekly Ridership" ,
       subtitle = '2022',
       size = 'Weekly Ridership\n(quantile)',
       fill = 'Weekly Ridership\n(quantile)') +
  mapTheme_a()+
  theme(panel.background = element_blank(),
        legend.key.size = unit(0.25, "cm"),
        legend.key = element_rect(colour = NA,fill=NA))

It is positive to see the increasing trajectory of bike rides over the course of 2022, outside of winter months when poor weather conditions naturally limit the Indego bike share program. Ridership of electric bikes follow a similar pattern of standard bikes suggesting further demand for this new service.

#Join census data to individual rides' df

rides <- indego_trips_2022 %>%
  left_join(., stations, by = c("start_station"="station")) %>%
  rename("start_GEOID" = "GEOID",
         "start_total_pop" = "total_pop",
         "start_pct_white" = "pct_white",
         "start_pct_Black" = "pct_Black",
         "start_pct_Asian" = "pct_Asian",
         "start_pct_HispLat" = "pct_HispLat",
         "start_pct_other_race" = "pct_other_race",
         "start_pct_employed" = "pct_employed",
        "start_weeks_active" = "weeks_active") %>%
  dplyr::select( -ridership_week) %>%
  st_drop_geometry() %>%
  left_join(., stations, by = c("end_station"="station")) %>%
  rename("end_GEOID" = "GEOID",
         "end_total_pop" = "total_pop",
         "end_pct_white" = "pct_white",
         "end_pct_Black" = "pct_Black",
         "end_pct_Asian" = "pct_Asian",
         "end_pct_HispLat" = "pct_HispLat",
         "end_pct_other_race" = "pct_other_race",
         "end_pct_employed" = "pct_employed",
        "end_weeks_active" = "weeks_active") %>%
  dplyr::select(-ridership_week) %>%
  mutate(start_time = as.POSIXct(start_time, tz="", format = "%m/%d/%Y %H:%M"),
         end_time = as.POSIXct(end_time, tz="", format = "%m/%d/%Y %H:%M"),
         interval60 = lubridate::ymd_h(substr(start_time,1,13)),
         week = lubridate::week(interval60),
         dotw = lubridate::wday(interval60, label=TRUE),
         duration = as.numeric(duration))


# Plot
ggplot(rides)+
  geom_freqpoly(aes(week, color = bike_type), binwidth = 1)+
  scale_color_manual(values = c("#6baed6", "#006d2c"), name = "Bike Type")+
  labs(title = "Rides by bike type over the course of 2022",
       y = "Number of rides",
       x = "Week")+
  plotTheme_b()

In addition to the above, we chose to investigate the spatial lag of ridership, as it demonstrates how inter-connected the network is. In other words, we expect areas of high ridership are likely to be next to areas of similarly high ridership.

To determine the spatial lag of ridership, we calculated the average ridership of the five nearest stations for all stations and grid cell centroids.

#Calculate the spatial lag of ridership of the 5 nearest stations - for stations dataset

# Prepare stations for measuring distance
stations_for_measuring <- stations %>%
  dplyr::select() %>%
  st_transform("ESRI:102729") %>% # transform so measurements are in feet
  st_coordinates()


neighbor_list <- knn2nb(knearneigh(stations_for_measuring, 5))

spatial_weights <- nb2listw(neighbor_list , style = "W")

stations$lag_knn5_ridership_week  <- lag.listw(spatial_weights, stations$ridership_week) #all ridership
stations$lag_knn5_ebike_ridership_week  <- lag.listw(spatial_weights, stations$ebike_ridership_week) # e-bike ridership
stations$lag_knn5_standard_ridership_week <- lag.listw(spatial_weights, stations$standard_ridership_week) # standard bike ridership


#Calculate the spatial lag of ridership of the 5 nearest grid cells - for grid

# Prepare stations for measuring distance
fishnet_for_measuring <- fishnet_centroids %>%
  dplyr::select() %>%
  st_transform("ESRI:102729") %>% # transform so measurements are in feet
  st_coordinates()


grid_neighbor_list <- knn2nb(knearneigh(fishnet_for_measuring, 5))

grid_spatial_weights <- nb2listw(grid_neighbor_list, style = "W")

grid$lag_knn5_ridership_week <- lag.listw(grid_spatial_weights, grid$ridership_week) #all ridership

grid$lag_knn5_ebike_ridership_week <- lag.listw(grid_spatial_weights, grid$ebike_ridership_week) # e-bike ridership

grid$lag_knn5_standard_ridership_week <- lag.listw(grid_spatial_weights, 
                                                     grid$standard_ridership_week) # standard bike ridership

# Plot
ggplot()+
  geom_sf(data = grid, aes(fill = (lag_knn5_ridership_week)))+
  scale_fill_gradient(low = "#eff3ff", high = "#08519c",
                      name = "Spatial lag of ridership")+
  geom_sf(data = stations, color = "black")+
  labs(title = "Spatial lag of ridership, 5 nearest cells")+
  ylim(39.89, 40.01)+
  xlim(-75.23, -75.12)+
  mapTheme_a()

Given that Indego limits bike rides to a maximum of 60 minutes before bikes must be returned to a station, an important factor for ridership may be how easy it would be to reach another station within an hour. However, not all rides are 60 minutes long - in fact, the average trip duration in 2022 was just over 17 minutes. Therefore, it’s important to look at the variable amount of time that a bike trip may take, and ensure there are stations available within that distance from a station. In doing so, we measured the number of stations available via bike ride from each grid cell centroid in 10 minute increments.

#Set up destinations - other stations (for NN_stations, these are also our trip origins)

station_destinations <- stations %>% # should have 183 stations
     dplyr::select(station) %>%
     rename(id = station) %>%
     mutate(opportunities = 1) %>%
     st_transform("EPSG:4326")

grid_origins <- st_centroid(grid) %>%
  dplyr::select(unique_id) %>%
  rename(id = unique_id) %>%
  st_transform("EPSG:4326")


# NN Stations
# Set up destinations - other stations (for NN_stations, these are also our trip origins)
station_destinations <- stations %>% # should have 183 stations
     dplyr::select(station) %>%
     rename(id = station) %>%
     mutate(opportunities = 1) %>%
     st_transform("EPSG:4326")

grid_origins <- st_centroid(grid) %>%
  dplyr::select(unique_id) %>%
  rename(id = unique_id) %>%
  st_transform("EPSG:4326")


## From stations to stations
#call on accessibility function to calculate travel times between each stations  and the other stations. Save the resulting dataset to be read back in.

# Initialize list of stations
stations_to_stations <- station_destinations %>%
  dplyr::select(id) %>%
  st_drop_geometry()

# for loop: 10-60 minutes in 10-minute intervals, stations to stations via bike
for (i in seq(10, 60, by = 10)) { # start, stop, step
  mode_selected = "BICYCLE"
  origin_points = station_destinations # origins
  desination_points = station_destinations # destinations
  column_labels = "nn_stations"
  stations_to_stations_access_i <- accessibility(r5r_core,
                                                   origins = origin_points ,
                                                   destinations = desination_points ,
                                                   mode = mode_selected,
                                                   time_window = i,
                                                   percentiles = 75,
                                                   cutoffs = i,
                                                   progress = TRUE) %>%
    dplyr::select(accessibility)
  colnames(stations_to_stations_access_i) <- paste0(column_labels, i) # rename column
  stations_to_stations <- append(stations_to_stations, stations_to_stations_access_i) # append each column
}

stations_to_stations<- as.data.frame(stations_to_stations) # convert to df

#save as csv
stations_to_stations %>% write_csv("data/accessibility/nn_stations_from_stations.csv")

## From grid cells to stations
# Initialize list of grid cells
grid_to_stations <- grid_origins %>%
  dplyr::select(id) %>%
  st_drop_geometry()

# for loop: 10-60 minutes in 10-minute intervals, grid centroids to stations via bike
for (i in seq(10, 60, by = 10)) { # start, stop, step
  mode_selected = "BICYCLE"
  origin_points = grid_origins # origins
  desination_points = station_destinations # destinations
  column_labels = "nn_stations"
  access_i <- accessibility(r5r_core,
                                                   origins = origin_points ,
                                                   destinations = desination_points ,
                                                   mode = mode_selected,
                                                   time_window = i,
                                                   percentiles = 75,
                                                   cutoffs = i,
                                                   progress = TRUE) %>%
    dplyr::select(accessibility)
  colnames(access_i) <- paste0(column_labels, i) # rename column
  grid_to_stations <- append(grid_to_stations, access_i) # append each column
}

grid_to_stations <- as.data.frame(grid_to_stations) # convert to df

# Save as CSV
grid_to_stations %>% write_csv("data/accessibility/nn_stations_from_grid.csv")

# Read CSV back in
combo_bike_access <- read_csv("data/accessibility/nn_stations_from_grid.csv") %>% 
  merge(grid, ., by.y = "id", by.x = "unique_id") %>% # join with unique ids for geometries
  st_as_sf() %>%
  st_transform("EPSG:4326")

grid <- combo_bike_access


#Do the same, but for the current stations df - so that we can plot correlations later.
station_combo_bike_access <- read_csv("data/accessibility/nn_stations_from_stations.csv") %>%
  merge(stations, ., by.y = "id", by.x = "station") %>% # join with unique ids for geometries
  st_as_sf() %>%
  st_transform("EPSG:4326")

stations <- station_combo_bike_access

# Plot

nn_stations_vars <- c("nn_stations10", "nn_stations20", "nn_stations30", "nn_stations40", "nn_stations50", "nn_stations60")

nn_stations_long <- dplyr::select(grid, all_of(nn_stations_vars)) %>%
  rename("10 minutes" = nn_stations10,
         "20 minutes" = nn_stations20,
         "30 minutes" = nn_stations30,
         "40 minutes" = nn_stations40,
         "50 minutes" = nn_stations50,
         "60 minutes" = nn_stations60) %>%
  gather(Variable, value, -geometry)

ggplot()+
  geom_sf(data = nn_stations_long, aes(fill=value), color = NA)+
  scale_fill_gradient(low = "#eff3ff", high = "#08519c")+
  labs(title = "Number of stations accessible within N minutes' bike ride",
       guide = "Number of stations")+
  facet_wrap(~Variable)+
  mapTheme_a()

2-4. Accessibility

When discussing the accessibility of a bike share network, national and local transportation associations predominantly reference the time or distance it takes for an individual to access a station to start their bike trip. For the purpose of this study, we define accessibility as the number of people who live within a 5-minute walk of at least one Indego station. This was selected based on research from the National Association of City Transportation Officials (2015) highlighting that 5 minutes is the maximum amount of time someone is willing walk to use a bike. Here, we felt that tract-level studies would not be granular enough to draw meaningful conclusions.

As shown in the map below, we calculated the number of people living within a 5-minute walk of all of Indego’s stations. Indego’s accessible service area is located predominantly in the central neighborhoods of the city with some extensions towards the south.

# calculate walking distance isochrones of all grid cells that contain a station
indego_isochrones <- mb_isochrone(st_centroid(grid %>% filter(station_count > 0)),
                                 profile = "walking",
                                 time = c(5))


# intersect the polygon with the grid, filter for grid cells that are at least 30% covered by isochrones
intersect_grid <- st_intersection(grid, indego_isochrones) %>%
  mutate(area_intersect = as.numeric(st_area(st_transform(., "ESRI:102729"))),
         proportion_covered = area_intersect / (1000*1000)) %>%
  filter(proportion_covered >= .3) %>%
  st_drop_geometry(.) %>%
  group_by(unique_id) %>%
  summarize(pop_per_cell = mean(pop_per_cell),
            proportion_covered = mean(proportion_covered))

# join to grid
grid_accessibility <- inner_join(grid,
                                 st_drop_geometry(intersect_grid %>%
                                                    dplyr::select(unique_id)),
                                 by = "unique_id")

# plot
ggplot()+
  geom_sf(data = blockgrps, fill = greyPalette5[1], color = greyPalette5[2])+
  geom_sf(data = grid, color=greyPalette5[2], fill = "transparent", size = 0.1, alpha = 0.3)+
  geom_sf(data = grid_accessibility, fill = bluePalette5[2], color=greyPalette5[2], size = 0.1, alpha = 0.5)+
  geom_sf(data = stations, shape=21, color=greyPalette5[4], alpha=0.8, size=1.5, stroke=1, fill="#AFF708")+
  ylim(39.89, 40.01)+
  xlim(-75.23, -75.12)+
  labs(title = "Areas with access to an Indego station\nwithin a 5-minute walk")+
  mapTheme_a()

Calculated this way, about 290,000 Philadelphians currently have access to the Indego system. This is equivalent to about 18% of the city’s 2020 population. Of course, Indego’s expansion plan is a multi-year plan and understandably will require incremental infills and expansions to maintain a robust and highly connected bike share network. This means that only certain areas of the city can be incorporated into Indego’s operating area at a time. Nevertheless, these numbers do suggest disparities in bike share accessibility across the city. The next section will further investigate the racial and socio-economic equity of the system.

2-5. Equity

According to OTIS, Indego’s equity goal is for the bikeshare service area to mimic the ethnic and racial distribution of the city. As previously discussed, we calculated the number of people that live in grid cells within a 5-minute walk of existing Indego stations. In this section, we aggregate the racial and socio-economic data that characterize the different geographies. For instance, on average, residents of areas that have access to at least one Indego station are 46% white, compared to the city’s average of 34%. The results are shown in the table below.

# All of Philadelphia
grid_all_phl <- st_drop_geometry(blockgrps) %>%
  mutate(Category = "All Philadelphia") %>%
  group_by(Category) %>%
  summarize(Mean_Pct_White = round(mean(pct_white, na.rm = TRUE), 1),
            Mean_Pct_Black = round(mean(pct_Black, na.rm = TRUE), 1),
            Mean_Pct_Asian = round(mean(pct_Asian, na.rm = TRUE), 1),
            Mean_Pct_HispLat = round(mean(pct_HispLat, na.rm = TRUE), 1),
            Mean_Pct_Other_Race = round(mean(pct_other_race, na.rm = TRUE), 1),
            Mean_Inc = round(mean(medIncome, na.rm = TRUE)),
            Mean_Pct_Employed = round(mean(pct_employed, na.rm = TRUE), 1),
            Mean_Age = round(mean(medAge, na.rm = TRUE)),
            Total_Population = sum(total_pop)) %>%
  mutate(pct_phl_population = round(Total_Population / (sum(blockgrps$total_pop))*100, 1))

# grid cells with access
grid_ids_with_access <- unique(grid_accessibility$unique_id)

# grid cells by access status
grid_equity_summary <- st_drop_geometry(grid) %>%
  mutate(Category = case_when(
    unique_id %in% grid_ids_with_access ~ "Has Access",
    unique_id %!in% grid_ids_with_access ~ "No Access")) %>%
  filter(Category == "Has Access") %>%
  group_by(Category) %>%
  summarize(Mean_Pct_White = round(mean(pct_white, na.rm = TRUE), 1),
            Mean_Pct_Black = round(mean(pct_Black, na.rm = TRUE), 1),
            Mean_Pct_Asian = round(mean(pct_Asian, na.rm = TRUE), 1),
            Mean_Pct_HispLat = round(mean(pct_HispLat, na.rm = TRUE), 1),
            Mean_Pct_Other_Race = round(mean(pct_other_race, na.rm = TRUE), 1),
            Mean_Inc = round(mean(medIncome, na.rm = TRUE)),
            Mean_Pct_Employed = round(mean(pct_employed, na.rm = TRUE), 1),
            Mean_Age = round(mean(medAge, na.rm = TRUE)),
            Total_Population = sum(pop_per_cell, na.rm = TRUE)) %>%
  mutate(pct_phl_population = round(Total_Population / (sum(blockgrps$total_pop))*100, 1)) %>%
  rbind(., grid_all_phl)

# Kable table
grid_equity_summary %>%
  st_drop_geometry() %>%
  kbl(format.args = list(big.mark = ","),
      col.names = c("Category",
                    "% White", 
                    "% Black",
                    "% Asian",
                    "% Hispanic or Latino",
                    "% Other Race",
                    "Median Income",
                    "% Employed",
                    "Median Age",
                    "Population",
                    "% Philadelphia Population"),
      caption = "Station Accessibility* by Demographics - 2022 System") %>%
  kable_classic_2(full_width = F) %>%
  footnote(general_title = "\n",
           general = "Accessibility is defined as the number of people who live within a 5-minute walk of at least 1 Indego station.\nValues rounded to nearest whole number")

Station Accessibility* by Demographics - 2022 System
Category	% White	% Black	% Asian	% Hispanic or Latino	% Other Race	Median Income	% Employed	Median Age	Population	% Philadelphia Population
Has Access	46.1	32.9	9.1	8.1	6.9	69,710	52.5	35	290,038	18.3
All Philadelphia	33.7	42.8	6.7	13.7	5.7	57,699	46.1	37	1,581,531	100.0

Accessibility is defined as the number of people who live within a 5-minute walk of at least 1 Indego station. Values rounded to nearest whole number

From the below visualizations, we can observe that areas inside the Indego bike share network tend to be whiter, wealthier, and more employed than areas outside of the network. While these plots cannot speak directly to the causality of these trends, they still suggest racial and economic disparities in terms of access to the bike share system in Philadelphia.

equity1 <- ggplot(data = grid_equity_summary %>%
    mutate(Category  = fct_relevel(Category, "Has Access", "All Philadelphia")))+
  geom_bar(aes(x = Category, y = Mean_Pct_White), stat = "identity", 
           fill= c(bluePalette5[3], bluePalette5[5]),width = 0.5)+
  labs(title = "Proportion of White Residents",
       subtitle = "Indego Service Area : 5-minute walk",
       x = "Accessibility Status",
       y = "Median proportion of white residents")+
  plotTheme_c()

# Min
equity2 <- ggplot(data = grid_equity_summary %>%
    mutate(Category  = fct_relevel(Category, "Has Access", "All Philadelphia")))+
  geom_bar(aes(x = Category, y = Mean_Pct_Black), stat = "identity", 
           fill= c( bluePalette5[3], bluePalette5[5]), width = 0.5)+
  labs(title = "Proportion of Black Residents",
       subtitle = "Indego Service Area : 5-minute walk",
       x = "Accessibility Status",
       y = "Median Proportion of Black residents")+
  plotTheme_c()

# Min
equity3 <- ggplot(data = grid_equity_summary %>%
    mutate(Category  = fct_relevel(Category, "Has Access", "All Philadelphia")))+
  geom_bar(aes(x = Category, y = Mean_Inc), stat = "identity", 
           fill= c(bluePalette5[3],bluePalette5[5]), width = 0.5)+
  labs(title = "Median Household Income",
       subtitle = "Indego Service Area : 5-minute walk",
       x = "Accessibility Status",
       y = "Median household income")+
  plotTheme_c()

# Min
equity4 <- ggplot(data = grid_equity_summary %>%
    mutate(Category  = fct_relevel(Category, "Has Access", "All Philadelphia")))+
  geom_bar(aes(x = Category, y = Mean_Pct_Employed), stat = "identity",
           fill= c(bluePalette5[3], bluePalette5[5]), width = 0.5)+
  labs(title = "Proportion of Employed Residents",
       subtitle = "Indego Service Area : 5-minute walk",
       x = "Accessibility Status",
       y = "Median proportion of employed residents")+
  plotTheme_c()


grid.arrange(equity1, equity2, equity3, equity4, nrow = 2)

2.6 Additional Explanatory Features

In this section, we investigate features believed to contribute to ridership levels as well as aspects of accessibility and equity. We feature engineered new variables to understand what affects bike share ridership in Philadelphia. These features include distance to various amenities and services, including the bike network, SEPTA subway stops and regional rail stations, the high injury network, parks and outdoor spaces, restaurants, grocery stores, tourist attractions, and jobs. The final results are shown below.

For each feature, we calculate distance with respect to stations, e.g. measure the distance from each station to the nearest park, in order to understand how these features affect ridership. We also calculate distance with respect to each analysis grid cell, e.g. measure the distance from each grid cell centroid to the nearest park. This ultimately lets us predict future ridership for each grid cell.

2-6-1. Bike Path Network

We hypothesize that proximity to the bike network boosts station ridership. To evaluate this, we sourced the latest available bike network data (2016) from Open Data Philly and calculated distances from bike path network to existing bike stations. For later modeling, we also calculated the distances from the bike path network to each fishnet grid cell.

bike_network <- st_read("https://opendata.arcgis.com/datasets/b5f660b9f0f44ced915995b6d49f6385_0.geojson") %>%
  st_transform(crs= 4326)

#Calculate distance from each station to nearest bike network length

# Prepare stations for measuring distance
stations_for_measuring <- stations %>%
  dplyr::select() %>%
  st_transform("ESRI:102729") %>% # transform so measurements are in feet
  st_coordinates()

# Prepare for measuring distance
bike_network_for_measuring <- bike_network %>%
  dplyr::select() %>%
  st_centroid() %>%
  st_transform("ESRI:102729") %>% # transform so measurements are in feet
  st_coordinates()

# Measure distance to nearest bike newtork length centroid
nn_bike_network <- get.knnx(bike_network_for_measuring, # measure to
                    stations_for_measuring, # measure from
                    k=1)$nn.dist # k = 1, return the distance

# join to stations df
stations <- cbind(stations, nn_bike_network) %>%
  rename("bike_network_dist" = "nn_bike_network") # distance in feet to nearest bike network length centroid

#Calculate distance from bike network to each grid cell

# Measure distance to nearest park
nn_net_bike_network <- get.knnx(bike_network_for_measuring, # measure to
                    fishnet_for_measuring, # measure from
                    k=1)$nn.dist # k = 1, return the distance

# join to grid df
grid <- cbind(grid, nn_net_bike_network) %>%
  rename("bike_network_dist" = "nn_net_bike_network")


# Plot
ggplot()+
  geom_sf(data = blockgrps, fill=greyPalette5[1], color="transparent")+
  geom_sf(data = bike_network, color = bluePalette5[2], size=2)+
  geom_sf(data = stations, shape=21, color=greyPalette5[4], alpha=0.8, size=1.5, stroke=1, fill="#AFF708")+
    ylim(39.89, 40.01)+
    xlim(-75.23, -75.12)+
  labs(title = "Bike Path Network Data",
       subtitle = "Open Data Philly, 2016",
       #caption = 'Figure 2-2-4-1'
       )+
  mapTheme_a()

2-6-2. Parks

Similarly, we suspect that proximity to parks may be associated with higher ridership, since parks are an attractive destination for outdoorsy people, who are likely to also be interested in biking. We sourced park data from Open Data Philly and calculated the minimum distances from stations to parks, and from parks to grid cells for later modeling.

parks <- st_read("https://opendata.arcgis.com/datasets/d52445160ab14380a673e5849203eb64_0.geojson") %>%
  st_transform("EPSG:4326") %>%
  dplyr::select(PUBLIC_NAME, PROPERTY_CLASSIFICATION, ACREAGE, PPR_USE) %>%
  filter(PROPERTY_CLASSIFICATION != "POCKET_PARK") #remove pocket parks, since they are so small


# Calculate distance from each station to nearest park

# Prepare parks for measuring distance
parks_for_measuring <- parks %>%
  dplyr::select() %>%
  st_centroid() %>%
  st_transform("ESRI:102729") %>% # transform so measurements are in feet
  st_coordinates()

# Prepare stations for measuring distance
stations_for_measuring <- stations %>%
  dplyr::select() %>%
  st_transform("ESRI:102729") %>% # transform so measurements are in feet
  st_coordinates()

# Measure distance to nearest park
nn_park <- get.knnx(parks_for_measuring, # measure to
                    stations_for_measuring, # measure from
                    k=1)$nn.dist # k = 1, return the distance

# join to stations df
stations <- cbind(stations, nn_park) %>%
  rename("Park_dist" = "nn_park")


#Calculate distance from each grid cell centroid to nearest park centroid

# Prepare stations for measuring distance
fishnet_for_measuring <- fishnet_centroids %>%
  dplyr::select() %>%
  st_transform("ESRI:102729") %>% # transform so measurements are in feet
  st_coordinates()

# Measure distance to nearest park
nn_net_park <- get.knnx(parks_for_measuring, # measure to
                    fishnet_for_measuring, # measure from
                    k=1)$nn.dist # k = 1, return the distance

# join to stations df
grid <- cbind(grid, nn_net_park) %>%
  rename("Park_dist" = "nn_net_park")


# Plot
ggplot()+
  geom_sf(data = blockgrps, fill=greyPalette5[1], color=greyPalette5[2], size = 0.1)+
  geom_sf(data = parks, aes(fill = PROPERTY_CLASSIFICATION), color=greyPalette5[1], size = 0.1, alpha=0.8)+
  scale_fill_manual(values = c(bluePalette5[3],bluePalette5[5],bluePalette5[4])) +
  geom_sf(data = stations, shape=21, color=greyPalette5[4], alpha=0.8, size=1.5, stroke=1, fill="#AFF708")+
    ylim(39.89, 40.01)+
    xlim(-75.23, -75.12)+
  labs(title = "Parks",
       subtitle = "Open Data Philly, 2016",
       fill = "Park Type")+
  mapTheme_a()

2-6-3. High Injury Network

The High Injury Network is made up of all the streets with the highest rates of fatalities and serious injuries per mile in Philadelphia. Just 12% of Philly’s streets see 80% of all traffic deaths. We hypothesize that bikers may be sensitive to the safety of a street, and therefore may be less inclined to rent a bike close to one of these network lengths. To investigate this, we calculated distance from stations and grid cells to their nearest High Injury Network length in 2020.

high_injury <- st_read("https://phl.carto.com/api/v2/sql?filename=high_injury_network_2020&format=geojson&skipfields=cartodb_id&q=SELECT+*+FROM+high_injury_network_2020") %>%
  st_transform("EPSG:4326")

#Calculate distance from each station to nearest high injury network length

# Prepare for measuring distance
high_injury_for_measuring <- high_injury %>%
  dplyr::select() %>%
  st_centroid() %>%
  st_transform("ESRI:102729") %>% # transform so measurements are in feet
  st_coordinates()

# Measure distance to nearest high injury network centroid
nn_high_injury <- get.knnx(high_injury_for_measuring, # measure to
                    stations_for_measuring, # measure from
                    k=1)$nn.dist # k = 1, return the distance

# join to stations df
stations <- cbind(stations, nn_high_injury) %>%
  rename("high_injury_dist" = "nn_high_injury") # distance in feet to nearest high injury network length

# Calculate distance from high injury network to each grid cell

# Measure distance to nearest park
nn_net_high_injury <- get.knnx(high_injury_for_measuring, # measure to
                    fishnet_for_measuring, # measure from
                    k=1)$nn.dist # k = 1, return the distance

# join to stations df
grid <- cbind(grid, nn_net_high_injury) %>%
  rename("high_injury_dist" = "nn_net_high_injury")

# Plot
ggplot()+
  geom_sf(data = blockgrps, fill=greyPalette5[1], color=greyPalette5[2], size = 0.1)+
  geom_sf(data = high_injury, linetype = 1, color = bluePalette5[4], lwd=0.7)+
  geom_sf(data = stations, shape=21, color=greyPalette5[4], alpha=0.8, size=1.5, stroke=1, fill="#AFF708")+
  ylim(39.89, 40.01)+
  xlim(-75.23, -75.12)+
  labs(title = "High Injury Network" ,
       subtitle = 'Open Data Philly 2020')+
  mapTheme_a()

2-6-4. SEPTA

With the growing trend in multimodal mobility, we expect that some system customers may use Indego to connect to other transit options. To look into this, we calculated the distance from stations and grid cells to their nearest SEPTA subway station, their nearest bus or trolley stop, and their nearest regional rail station.

septa_stops <- st_read("data/septa_stops/septa_stops.geojson") %>%
  st_transform("EPSG:4326")%>%
  mutate(type = sub("(.*_)([a-z]+)(:.*)", "\\2", stop_id)) # extract bus/rail label

subway_routes <- st_read("data/septa_stops/septa_routes.geojson") %>%
  st_transform("EPSG:4326") %>%
  filter(mode == "SUBWAY")

# ID subway stops
subway_stop_names <- c("Fern Rock Transportation Center",
                      "Olney Transportation Center",
                      "Logan Station - BSL",
                      "Wyoming Station - BSL", 
                      "Hunting Park Station - BSL",
                      "Erie Station - BSL", 
                      "Allegheny Station - BSL", 
                      "North Philadelphia Station - BSL", 
                      "Susquehanna Dauphin Station - BSL",
                      "Cecil B Moore Station - BSL",
                      "Girard Station - BSL", 
                      "Fairmount Station - BSL", 
                      "Spring Garden Station - BSL", 
                      "Race Vine Station - BSL", 
                      "City Hall Station - BSL", 
                      "Walnut Locust Station - BSL", 
                      "Lombard South Station - BSL", 
                      "Ellsworth Federal Station - BSL", 
                      "Tasker Morris Station - BSL", 
                      "Snyder Station - BSL", 
                      "Oregon Station - BSL", #
                      "NRG Station - BSL",
                      "8th St & Market St - BSL",
                      "Chinatown Station - BSL",
                      "69th Street Transportation Center",
                      "Millbourne Station - MFL", 
                      "63rd St Station - MFL",
                      "60th St Station - MFL", 
                      "56th St Station - MFL", 
                      "52nd St Station - MFL", 
                      "46th St Station - MFL", 
                      "40th St Station - MFL", 
                      "34th St Station - MFL",
                      "30th St Station - MFL", 
                      "15th St Station - MFL", 
                      "13th St Station - MFL", 
                      "11th St Station - MFL", 
                      "8th St Station - MFL",
                      "5th St Independence Hall Station - MFL", 
                      "2nd St Station - MFL",
                      "Spring Garden Station - MFL", 
                      "Berks Station - MFL", 
                      "Girard Station - MFL",
                      "York Dauphin Station - MFL", 
                      "Huntingdon Station - MFL", 
                      "Somerset Station - MFL", 
                      "Allegheny Station - MFL", 
                      "Tioga Station - MFL", 
                      "Erie Torresdale Station - MFL", 
                      "Church Station - MFL", 
                      "Arrott Transportation Center - MFL",
                      "Frankford Transportation Center - MFL")

subway_stops <- septa_stops %>%
  filter(stop_name %in% subway_stop_names)

bus_trolley_stops <- septa_stops %>%
  filter(type == "bus") %>%
  filter(stop_name %!in% subway_stop_names)
  
rail_stops <- septa_stops %>%
  filter(type == "rail")


# Distance to subway stops

# Prepare for measuring distance
subway_stops_for_measuring <- subway_stops %>%
  #filter(linked_to_street == "TRUE") %>% # remove null points
  dplyr::select() %>%
  st_centroid() %>%
  st_transform("ESRI:102729") %>% # transform so measurements are in feet
  st_coordinates()

# Measure distance to nearest septa stop centroid
nn_subway_stops <- get.knnx(subway_stops_for_measuring, # measure to
                    stations_for_measuring, # measure from
                    k=1)$nn.dist # k = 1, return the distance

# join to stations df
stations <- cbind(stations, nn_subway_stops) %>%
  rename("subway_stop_dist" = "nn_subway_stops") # distance in feet to nearest septa stop


#Calculate distance from subway stops to each grid cell
# Measure distance to nearest park
nn_net_subway_stops <- get.knnx(subway_stops_for_measuring, # measure to
                    fishnet_for_measuring, # measure from
                    k=1)$nn.dist # k = 1, return the distance

# join to stations df
grid<- cbind(grid, nn_net_subway_stops) %>%
  rename("subway_stop_dist" = "nn_net_subway_stops")


# Distance to Bus and Trolley stops

# Prepare for measuring distance
bus_trolley_stops_for_measuring <- bus_trolley_stops %>%
  filter(linked_to_street == "TRUE") %>% # remove null points
  dplyr::select() %>%
  st_centroid() %>%
  st_transform("ESRI:102729") %>% # transform so measurements are in feet
  st_coordinates()

# Measure distance to nearest septa stop centroid
nn_bus_trolley_stops <- get.knnx(bus_trolley_stops_for_measuring, # measure to
                    stations_for_measuring, # measure from
                    k=1)$nn.dist # k = 1, return the distance

# join to stations df
stations <- cbind(stations, nn_bus_trolley_stops) %>%
  rename("bus_trolley_stop_dist" = "nn_bus_trolley_stops") # distance in feet to nearest septa stop

# Calculate distance from bus and trolley stops to each grid cell
# Measure distance to nearest park
nn_net_bus_trolley_stops <- get.knnx(bus_trolley_stops_for_measuring, # measure to
                    fishnet_for_measuring, # measure from
                    k=1)$nn.dist # k = 1, return the distance

# join to stations df
grid <- cbind(grid, nn_net_bus_trolley_stops) %>%
  rename("bus_trolley_stop_dist" = "nn_net_bus_trolley_stops")


# Distance to Rail stops

# Prepare for measuring distance
rail_stops_for_measuring <- rail_stops %>%
  filter(linked_to_street == "TRUE") %>% # remove null points
  dplyr::select() %>%
  st_centroid() %>%
  st_transform("ESRI:102729") %>% # transform so measurements are in feet
  st_coordinates()

# Measure distance to nearest rail stop centroid
nn_rail_stops <- get.knnx(rail_stops_for_measuring, # measure to
                    stations_for_measuring, # measure from
                    k=1)$nn.dist # k = 1, return the distance

# join to stations df
stations <- cbind(stations, nn_rail_stops) %>%
  rename("rail_stop_dist" = "nn_rail_stops") # distance in feet to nearest rail stop


# Calculate distance from rail stops to each grid cell
# Measure distance to nearest rail stop
nn_net_rail_stops <- get.knnx(rail_stops_for_measuring, # measure to
                    fishnet_for_measuring, # measure from
                    k=1)$nn.dist # k = 1, return the distance

# join to stations df
grid <- cbind(grid, nn_net_rail_stops) %>%
  rename("rail_stop_dist" = "nn_net_rail_stops")

# Plot
ggplot()+
  geom_sf(data = blockgrps, fill=greyPalette5[1], color=greyPalette5[2], size = 0.1)+
  geom_sf(data=subway_routes,  linetype = 1, color = bluePalette5[4], lwd=0.7)+
  geom_sf(data = stations, shape=21, color=greyPalette5[4], alpha=0.8, size=1.5, stroke=1, fill="#AFF708")+
  geom_sf(data = subway_stops, shape=21, color=greyPalette5[4], alpha=0.8, size=2.2, stroke=1, fill=bluePalette5[4])+
  labs(title = "SEPTA Subway Stations" ,
       subtitle = '2022')+
  ylim(39.904, 40.04)+
  xlim(-75.28, -75.05)+
  mapTheme_a()

2-6-5. Job Opportunities

Additionally, we expect that some customers use Indego to travel to and from work, and therefore ridership may be higher in areas with more jobs. We used the LODES latest available (2019) dataset of job origin and destinations to identify jobs across the city. We looked both at job density across census block groups, as shown in the first map below, as well as the number of jobs available within a certain distance from each grid cell. To dive a bit deeper, we calculated the number of jobs from each station and grid cell accessible via bike within different increments of time. The grid maps below highlight that hte concentration of jobs accessible by bike are located in Center City. This extends outwards as the minutes of bike rides increase. We understand that since the latest available data is from 2019, it may be misleading given the impact of the pandemic. However, we still think it may be a useful tool to understand where users are generally drawn to.

jobs <- st_read("data/loeds_2019/lehd_pa.csv") %>%
  filter(substr(w_geocode, 1, 5) == "42101") %>% # filter for work GEOIDs within Philadelphia
  mutate(dest_tract = substr(as.character(w_geocode), 1, 12)) %>% #extract block group
  group_by(dest_tract) %>%
  summarise(total_jobs = sum(as.numeric(S000))) %>%
  rename(GEOID = dest_tract)

# join to blockgroups df

blockgrps <- left_join(blockgrps, jobs, by = "GEOID")%>%
  mutate(job_density = round(total_jobs / area * 1000000,2))


ggplot() +
  #annotation_map_tile("cartodark", forcedownload = FALSE)+
  geom_sf(data = blockgrps, fill=greyPalette5[1], color=greyPalette5[2], size = 0.1)+
  geom_sf(data = blockgrps, aes(fill = q5(job_density)), color = greyPalette5[1], size = 0.2, alpha = 0.8) +
  scale_fill_manual(values = bluePalette5,
                    labels = qBr(blockgrps, "job_density"),
                    name = "Jobs per km2\n(quantile)",
                    na.translate=FALSE) +
  labs(title = "Job Density",
       subtitle = "LODES Employment Data 2019")  +
  mapTheme_a()

## Stations to # Jobs - Biking
jobs_available <- stations %>%
     dplyr::select(station, total_jobs) %>%
     rename(id = station,
            opportunities = total_jobs) %>%
     st_transform("EPSG:4326")

# Initialize list of stations
stations_to_jobs_bike <- station_destinations %>%
  dplyr::select(id) %>%
  st_drop_geometry()

# for loop: 10-60 minutes in 10-minute intervals, stations to stations via bike
for (i in seq(10, 60, by = 10)) { # start, stop, step
  mode_selected = "BICYCLE"
  origin_points = station_destinations # origins
  desination_points = jobs_available # destinations
  column_labels = "nn_jobs"
  access_i <- accessibility(r5r_core,
                                                   origins = origin_points ,
                                                   destinations = desination_points ,
                                                   mode = mode_selected,
                                                   time_window = i,
                                                   percentiles = 75,
                                                   cutoffs = i,
                                                   progress = TRUE) %>%
    dplyr::select(accessibility)
  colnames(access_i) <- paste0(column_labels, i) # rename column
  stations_to_jobs_bike <- append(stations_to_jobs_bike, access_i) # append each column
}

stations_to_jobs_bike <- as.data.frame(stations_to_jobs_bike) # convert to df
g(stations_to_jobs_bike) # glimpse

stations_to_jobs_bike %>% write_csv("data/accessibility/nn_jobs_from_stations_bike.csv")

## Grid to jobs
# Initialize list of stations
grid_to_jobs_bike <- grid_origins %>%
  dplyr::select(id) %>%
  st_drop_geometry()

# for loop: 10-60 minutes in 10-minute intervals, stations to stations via bike
for (i in seq(10, 60, by = 10)) { # start, stop, step
  mode_selected = "BICYCLE"
  origin_points = grid_origins # origins
  desination_points = jobs_available # destinations
  column_labels = "nn_jobs"
  access_i <- accessibility(r5r_core,
                                                   origins = origin_points ,
                                                   destinations = desination_points ,
                                                   mode = mode_selected,
                                                   time_window = i,
                                                   percentiles = 75,
                                                   cutoffs = i,
                                                   progress = TRUE) %>%
    dplyr::select(accessibility)
  colnames(access_i) <- paste0(column_labels, i) # rename column
  grid_to_jobs_bike <- append(grid_to_jobs_bike, access_i) # append each column
}

grid_to_jobs_bike <- as.data.frame(grid_to_jobs_bike) # convert to df
g(grid_to_jobs_bike) # glimpse

# write to csv
grid_to_jobs_bike %>% write_csv("data/accessibility/nn_jobs_from_grid_bike.csv")


# from grid cells - read back in
combo_bike_nn_jobs_from_grid <- read_csv("data/accessibility/nn_jobs_from_grid_bike.csv") %>%
  merge(grid, ., by.y = "id", by.x = "unique_id") %>% # join with unique ids for geometries
  st_as_sf() %>%
  st_transform("EPSG:4326")

grid <- combo_bike_nn_jobs_from_grid

# from stations
combo_bike_nn_jobs_from_stations <- read_csv("data/accessibility/nn_jobs_from_stations_bike.csv") %>%
  merge(stations, ., by.y = "id", by.x = "station") %>% # join with unique ids for geometries
  st_as_sf() %>%
  st_transform("EPSG:4326")

stations <- combo_bike_nn_jobs_from_stations

# Plot
nn_jobs_vars <- c("nn_jobs10",
                         "nn_jobs20",
                         "nn_jobs30",
                         "nn_jobs40",
                         "nn_jobs50",
                         "nn_jobs60")

nn_jobs_long <- dplyr::select(grid, all_of(nn_jobs_vars)) %>%
  rename("10 minutes" = nn_jobs10,
         "20 minutes" = nn_jobs20,
         "30 minutes" = nn_jobs30,
         "40 minutes" = nn_jobs40,
         "50 minutes" = nn_jobs50,
         "60 minutes" = nn_jobs60) %>%
  gather(Variable, value, -geometry)

ggplot()+
  geom_sf(data = nn_jobs_long, aes(fill=value), color = NA)+
  scale_fill_gradient(low = "#eff3ff", high = "#08519c")+
  labs(title = "Number of jobs accessible within N minutes' bike ride",
       guide = "Number of jobs")+
  facet_wrap(~Variable)+
  mapTheme_a()

2-6-6. Restaurants

Furthermore, we looked at locations of entertainment, such as restaurants, to see how this may affect Indego ridership. Similarly to above, we explored the number of restaurants within various distances from each grid cell. We calculated the number of restaurants nearby in walking distance and biking distance over time intervals. The results are shown below. Download OSM’s restaurant points for Philadelphia

# Restaurants in walking distance
# OSM download Code by Michael Fichman - github.com/mafichman

# Check out the wiki for the key/value pairs for
# Different Amenities - that lets you figure out what you
# Can call from OSM https://wiki.openstreetmap.org/wiki/Category:Tag_descriptions_by_value

# set bounding box - the maximum x-y extent you are interested in
q0 <- opq(bbox = c(-75.3,39.85,-74.9,40.15)) 

restaurant <- add_osm_feature(opq = q0, key = 'amenity', value = "restaurant") %>%
  osmdata_sf(.)

restaurant.sf <- st_geometry(restaurant$osm_points) %>%
  st_transform(4326) %>%
  st_sf() %>%
  cbind(., restaurant$osm_points$name) %>%
  cbind(., restaurant$osm_points$osm_id) %>%
  cbind(., restaurant$osm_points$cuisine)

# Remove "restaurant.osm_points." string from column names
colnames(restaurant.sf) <- gsub("restaurant.osm_points.","",colnames(restaurant.sf))

# remove points that aren't in Philly

restaurants <- st_join(restaurant.sf, 
                       (phl_boundary %>% st_transform(4326) %>% mutate(PHL = "yes")),
                       join = st_intersects) %>%
  filter(PHL == "yes") %>%
  dplyr::select(osm_id) %>%
  st_transform("EPSG:4326") %>%
  rename(id = osm_id) %>% # for R5R calculation
  mutate(opportunities = 1) # for R5R calculation

### Walking - Stations to restaurants
# Initialize list of stations
stations_to_restaurants <- station_destinations %>%
  dplyr::select(id) %>%
  st_drop_geometry()

# for loop: 10-60 minutes in 10-minute intervals, stations to stations via bike
for (i in seq(10, 60, by = 10)) { # start, stop, step
  mode_selected = "WALK"
  origin_points = station_destinations # origins
  desination_points = restaurants # destinations
  column_labels = "nn_restaurants"
  access_i <- accessibility(r5r_core,
                                                   origins = origin_points ,
                                                   destinations = desination_points ,
                                                   mode = mode_selected,
                                                   time_window = i,
                                                   percentiles = 75,
                                                   cutoffs = i,
                                                   progress = TRUE) %>%
    dplyr::select(accessibility)
  colnames(access_i) <- paste0(column_labels, i) # rename column
  stations_to_restaurants <- append(stations_to_restaurants, access_i) # append each column
}

stations_to_restaurants <- as.data.frame(stations_to_restaurants) # convert to df
g(stations_to_restaurants) # glimpse

# stations_to_restaurants %>% write_csv("data/accessibility/nn_restaurants_from_stations_walking.csv")


### Walking - Grid to restaurants
# Initialize list of stations
grid_to_restaurants <- grid_origins %>%
  dplyr::select(id) %>%
  st_drop_geometry()

# for loop: 10-60 minutes in 10-minute intervals, stations to stations via bike
for (i in seq(10, 60, by = 10)) { # start, stop, step
  mode_selected = "WALK"
  origin_points = grid_origins # origins
  desination_points = restaurants # destinations
  column_labels = "nn_restaurants"
  access_i <- accessibility(r5r_core,
                                                   origins = origin_points ,
                                                   destinations = desination_points ,
                                                   mode = mode_selected,
                                                   time_window = i,
                                                   percentiles = 75,
                                                   cutoffs = i,
                                                   progress = TRUE) %>%
    dplyr::select(accessibility)
  colnames(access_i) <- paste0(column_labels, i) # rename column
  grid_to_restaurants <- append(grid_to_restaurants, access_i) # append each column
}

grid_to_restaurants <- as.data.frame(grid_to_restaurants) # convert to df
g(grid_to_restaurants) # glimpse

# Write
grid_to_restaurants %>% write_csv("data/accessibility/nn_restaurants_from_grid_walking.csv")


# Read back in
combo_nn_restaurant_by_grid <- read_csv("data/accessibility/nn_restaurants_from_grid_walking.csv") %>%
  merge(grid, ., by.y = "id", by.x = "unique_id") %>% # join with unique ids for geometries
  st_as_sf() %>%
  st_transform("EPSG:4326") %>%
  rename(
    "nn_restaurants_walk_10" = nn_restaurants10,
    "nn_restaurants_walk_20" = nn_restaurants20,
    "nn_restaurants_walk_30" = nn_restaurants30,
    "nn_restaurants_walk_40" = nn_restaurants40,
    "nn_restaurants_walk_50" = nn_restaurants50,
    "nn_restaurants_walk_60" = nn_restaurants60
     )

grid <- combo_nn_restaurant_by_grid  


combo_nn_restaurant_by_stations <- read_csv("data/accessibility/nn_restaurants_from_stations_walking.csv") %>%
  merge(stations, ., by.y = "id", by.x = "station") %>% # join with unique ids for geometries
  st_as_sf() %>%
  st_transform("EPSG:4326") %>%
  rename(
    "nn_restaurants_walk_10" = nn_restaurants10,
    "nn_restaurants_walk_20" = nn_restaurants20,
    "nn_restaurants_walk_30" = nn_restaurants30,
    "nn_restaurants_walk_40" = nn_restaurants40,
    "nn_restaurants_walk_50" = nn_restaurants50,
    "nn_restaurants_walk_60" = nn_restaurants60
     )

stations <- combo_nn_restaurant_by_stations

# Plot
nn_restaurants_vars <- c("nn_restaurants_walk_10",
                         "nn_restaurants_walk_20",
                         "nn_restaurants_walk_30",
                         "nn_restaurants_walk_40",
                         "nn_restaurants_walk_50",
                         "nn_restaurants_walk_60")

nn_restaurants_long <- dplyr::select(grid, all_of(nn_restaurants_vars)) %>%
  rename(
         "10 minutes" = nn_restaurants_walk_10,
         "20 minutes" = nn_restaurants_walk_20,
         "30 minutes" = nn_restaurants_walk_30,
         "40 minutes" = nn_restaurants_walk_40,
         "50 minutes" = nn_restaurants_walk_50,
         "60 minutes" = nn_restaurants_walk_60,) %>%
  gather(Variable, value, -geometry)

ggplot()+
  geom_sf(data = nn_restaurants_long, aes(fill=value), color = NA)+
  scale_fill_gradient(low = "#eff3ff", high = "#08519c")+
  labs(title = "Number of restaurants accessible within N minutes' walking distance",
       guide = "Number of restaurants")+
  facet_wrap(~Variable)+
  mapTheme_a()

### Biking - Stations to restaurants
# Initialize list of stations
stations_to_restaurants_bike <- station_destinations %>%
  dplyr::select(id) %>%
  st_drop_geometry()

# for loop: 10-60 minutes in 10-minute intervals, stations to stations via bike
for (i in seq(10, 60, by = 10)) { # start, stop, step
  mode_selected = "BICYCLE"
  origin_points = station_destinations # origins
  desination_points = restaurants # destinations
  column_labels = "nn_restaurants"
  access_i <- accessibility(r5r_core,
                                                   origins = origin_points ,
                                                   destinations = desination_points ,
                                                   mode = mode_selected,
                                                   time_window = i,
                                                   percentiles = 75,
                                                   cutoffs = i,
                                                   progress = TRUE) %>%
    dplyr::select(accessibility)
  colnames(access_i) <- paste0(column_labels, i) # rename column
  stations_to_restaurants_bike <- append(stations_to_restaurants_bike, access_i) # append each column
}

stations_to_restaurants_bike <- as.data.frame(stations_to_restaurants_bike) # convert to df
g(stations_to_restaurants_bike) # glimpse

stations_to_restaurants_bike %>% write_csv("data/accessibility/nn_restaurants_from_stations_bike.csv")

### Grid to Restaurants via bike
# Initialize list of stations
grid_to_restaurants_bike <- grid_origins %>%
  dplyr::select(id) %>%
  st_drop_geometry()

# for loop: 10-60 minutes in 10-minute intervals, stations to stations via bike
for (i in seq(10, 60, by = 10)) { # start, stop, step
  mode_selected = "BICYCLE"
  origin_points = grid_origins # origins
  desination_points = restaurants # destinations
  column_labels = "nn_restaurants"
  access_i <- accessibility(r5r_core,
                                                   origins = origin_points ,
                                                   destinations = desination_points ,
                                                   mode = mode_selected,
                                                   time_window = i,
                                                   percentiles = 75,
                                                   cutoffs = i,
                                                   progress = TRUE) %>%
    dplyr::select(accessibility)
  colnames(access_i) <- paste0(column_labels, i) # rename column
  grid_to_restaurants_bike <- append(grid_to_restaurants_bike, access_i) # append each column
}

grid_to_restaurants_bike <- as.data.frame(grid_to_restaurants_bike) # convert to df
g(grid_to_restaurants_bike) # glimpse

# write
grid_to_restaurants_bike %>% write_csv("data/accessibility/nn_restaurants_from_grid_bike.csv")

# Read in
combo_bike_nn_restaurants_from_grid <- read_csv("data/accessibility/nn_restaurants_from_grid_bike.csv") %>%
  merge(grid, ., by.y = "id", by.x = "unique_id") %>% # join with unique ids for geometries
  st_as_sf() %>%
  st_transform("EPSG:4326") %>%
  rename(
    "nn_restaurants_bike_10" = nn_restaurants10,
    "nn_restaurants_bike_20" = nn_restaurants20,
    "nn_restaurants_bike_30" = nn_restaurants30,
    "nn_restaurants_bike_40" = nn_restaurants40,
    "nn_restaurants_bike_50" = nn_restaurants50,
    "nn_restaurants_bike_60" = nn_restaurants60
     )

grid <- combo_bike_nn_restaurants_from_grid


combo_bike_nn_restaurants_from_stations <- read_csv("data/accessibility/nn_restaurants_from_stations_bike.csv") %>%
  merge(stations, ., by.y = "id", by.x = "station") %>% # join with unique ids for geometries
  st_as_sf() %>%
  st_transform("EPSG:4326") %>%
  rename(
    "nn_restaurants_bike_10" = nn_restaurants10,
    "nn_restaurants_bike_20" = nn_restaurants20,
    "nn_restaurants_bike_30" = nn_restaurants30,
    "nn_restaurants_bike_40" = nn_restaurants40,
    "nn_restaurants_bike_50" = nn_restaurants50,
    "nn_restaurants_bike_60" = nn_restaurants60
     )

stations <- combo_bike_nn_restaurants_from_stations

# Plot
nn_bike_restaurants_vars <- c(
                         "nn_restaurants_bike_10",
                         "nn_restaurants_bike_20",
                         "nn_restaurants_bike_30",
                         "nn_restaurants_bike_40",
                         "nn_restaurants_bike_50",
                         "nn_restaurants_bike_60")

nn_bike_restaurants_long <- dplyr::select(grid, all_of(nn_bike_restaurants_vars)) %>%
  rename(
         "10 minutes" = nn_restaurants_bike_10,
         "20 minutes" = nn_restaurants_bike_20,
         "30 minutes" = nn_restaurants_bike_30,
         "40 minutes" = nn_restaurants_bike_40,
         "50 minutes" = nn_restaurants_bike_50,
         "60 minutes" = nn_restaurants_bike_60) %>%
  gather(Variable, value, -geometry)

ggplot()+
  geom_sf(data = nn_bike_restaurants_long, aes(fill=value), color = NA)+
  scale_fill_gradient(low = "#eff3ff", high = "#08519c")+
  labs(title = "Number of restaurants accessible within N minutes' bike ride",
       guide = "Number of restaurants")+
  facet_wrap(~Variable)+
  mapTheme_a()

2-6-7. Grocery Stores Accessible via Bike

Access to essential amenities such as grocery stores may be very important to Indego customers. Once again, we calculated the number of grocery stores available from each station and grid cell via bike, in various increments of time.

#Download OSM's grocery store points for Philadelphia
q0 <- opq(bbox = c(-75.3,39.85,-74.9,40.15)) 

groceries <- add_osm_feature(opq = q0, key = 'shop', value = "supermarket")  %>%
  osmdata_sf(.)

groceries.sf <- st_geometry(groceries$osm_points) %>%
  st_transform(4326) %>%
  st_sf() %>%
  cbind(., groceries$osm_points$name) %>%
  cbind(., groceries$osm_points$osm_id)

# Remove "groceries.osm_points." string from column names
colnames(groceries.sf) <- gsub("groceries.osm_points.","",colnames(groceries.sf))

# remove points that aren't in Philly
groceries <- st_join(groceries.sf, 
                       (phl_boundary %>% st_transform(4326) %>% mutate(PHL = "yes")),
                       join = st_intersects) %>%
  filter(PHL == "yes") %>%
  dplyr::select(osm_id) %>%
  st_transform("EPSG:4326") %>%
  rename(id = osm_id) %>% # for R5R calculation
  mutate(opportunities = 1) # for R5R calculation

### Biking - Stations to groceries
# Initialize list of stations
stations_to_groceries <- station_destinations %>%
  dplyr::select(id) %>%
  st_drop_geometry()

# for loop: 10-60 minutes in 10-minute intervals, stations to stations via bike
for (i in seq(10, 60, by = 10)) { # start, stop, step
  mode_selected = "BICYCLE"
  origin_points = station_destinations # origins
  desination_points = groceries # destinations
  column_labels = "nn_groceries"
  access_i <- accessibility(r5r_core,
                                                   origins = origin_points ,
                                                   destinations = desination_points ,
                                                   mode = mode_selected,
                                                   time_window = i,
                                                   percentiles = 75,
                                                   cutoffs = i,
                                                   progress = TRUE) %>%
    dplyr::select(accessibility)
  colnames(access_i) <- paste0(column_labels, i) # rename column
  stations_to_groceries <- append(stations_to_groceries, access_i) # append each column
}

stations_to_groceries <- as.data.frame(stations_to_groceries) # convert to df
g(stations_to_groceries) # glimpse

stations_to_groceries %>% write_csv("data/accessibility/nn_groceries_from_stations.csv")

### Grid to Groceries via bike
# Initialize list of groceries
grid_to_groceries <- grid_origins %>%
  dplyr::select(id) %>%
  st_drop_geometry()

# for loop: 10-60 minutes in 10-minute intervals, stations to stations via bike
for (i in seq(10, 60, by = 10)) { # start, stop, step
  mode_selected = "BICYCLE"
  origin_points = grid_origins # origins
  desination_points = groceries # destinations
  column_labels = "nn_groceries"
  access_i <- accessibility(r5r_core,
                                                   origins = origin_points ,
                                                   destinations = desination_points ,
                                                   mode = mode_selected,
                                                   time_window = i,
                                                   percentiles = 75,
                                                   cutoffs = i,
                                                   progress = TRUE) %>%
    dplyr::select(accessibility)
  colnames(access_i) <- paste0(column_labels, i) # rename column
  grid_to_groceries <- append(grid_to_groceries, access_i) # append each column
}

grid_to_groceries <- as.data.frame(grid_to_groceries) # convert to df
g(grid_to_groceries) # glimpse

# write
grid_to_groceries %>% write_csv("data/accessibility/nn_groceries_from_grid.csv")

# Read in - by grid cell centroid
nn_groceries_from_grid <- read_csv("data/accessibility/nn_groceries_from_grid.csv") %>%
  merge(grid, ., by.y = "id", by.x = "unique_id") %>% # join with unique ids for geometries
  st_as_sf() %>%
  st_transform("EPSG:4326")

grid <- nn_groceries_from_grid


# Do the same, but for the current stations df - so that we can plot correlations later.
nn_groceries_from_stations <- read_csv("data/accessibility/nn_groceries_from_stations.csv") %>%
  merge(stations, ., by.y = "id", by.x = "station") %>% # join with unique ids for geometries
  st_as_sf() %>%
  st_transform("EPSG:4326")

stations <- nn_groceries_from_stations

# Plot
nn_groceries_vars <- c("nn_groceries10",
                         "nn_groceries20",
                         "nn_groceries30",
                         "nn_groceries40",
                         "nn_groceries50",
                         "nn_groceries60")

nn_groceries_long <- dplyr::select(grid, all_of(nn_groceries_vars)) %>%
  rename("10 minutes" = nn_groceries10,
         "20 minutes" = nn_groceries20,
         "30 minutes" = nn_groceries30,
         "40 minutes" = nn_groceries40,
         "50 minutes" = nn_groceries50,
         "60 minutes" = nn_groceries60) %>%
  gather(Variable, value, -geometry)

ggplot()+
  geom_sf(data = nn_groceries_long, aes(fill=value), color = NA)+
  scale_fill_gradient(low = "#eff3ff", high = "#08519c")+
  labs(title = "Number of grocery stores accessible within N minutes' bike ride",
       guide = "Number of grocery stores")+
  facet_wrap(~Variable)+
  mapTheme_a()

2-6-8. Tourist Attractions

Lastly, tourist attractions available by bike may be important for some riders visiting Philadelphia. Using data pulled from Open Street Map, we calculated the number of tourist attractions available from each station and grid cell via bike over different time intervals.

# Download OSM's landmarks points for Philadelphia
#Code by Michael Fichman - github.com/mafichman

# Check out the wiki for the key/value pairs for
# Different Amenities - that lets you figure out what you
# Can call from OSM https://wiki.openstreetmap.org/wiki/Category:Tag_descriptions_by_value

q0 <- opq(bbox = c(-75.3,39.85,-74.9,40.15)) 

landmarks <- add_osm_feature(opq = q0, key = 'tourism', value = c("attraction", "museum"))  %>%
  osmdata_sf(.)

landmarks.sf <- st_geometry(landmarks$osm_points) %>%
  st_transform(4326) %>%
  st_sf() %>%
  cbind(., landmarks$osm_points$name) %>%
  cbind(., landmarks$osm_points$osm_id)

# Remove "groceries.osm_points." string from column names
colnames(landmarks.sf) <- gsub("landmarks.osm_points.","",colnames(landmarks.sf))

# remove points that aren't in Philly
landmarks <- st_join(landmarks.sf, 
                       (phl_boundary %>% st_transform(4326) %>% mutate(PHL = "yes")),
                       join = st_intersects) %>%
  filter(PHL == "yes") %>%
  dplyr::select(osm_id) %>%
  st_transform("EPSG:4326") %>%
  rename(id = osm_id) %>% # for R5R calculation
  mutate(opportunities = 1) # for R5R calculation

### Biking - Stations to landmarks
# Initialize list of stations
stations_to_landmarks <- station_destinations %>%
  dplyr::select(id) %>%
  st_drop_geometry()

# for loop: 10-60 minutes in 10-minute intervals, stations to stations via bike
for (i in seq(10, 60, by = 10)) { # start, stop, step
  mode_selected = "BICYCLE"
  origin_points = station_destinations # origins
  desination_points = landmarks # destinations
  column_labels = "nn_landmarks"
  access_i <- accessibility(r5r_core,
                                                   origins = origin_points ,
                                                   destinations = desination_points ,
                                                   mode = mode_selected,
                                                   time_window = i,
                                                   percentiles = 75,
                                                   cutoffs = i,
                                                   progress = TRUE) %>%
    dplyr::select(accessibility)
  colnames(access_i) <- paste0(column_labels, i) # rename column
  stations_to_landmarks <- append(stations_to_landmarks, access_i) # append each column
}

stations_to_landmarks <- as.data.frame(stations_to_landmarks) # convert to df
g(stations_to_landmarks) # glimpse

stations_to_landmarks %>% write_csv("data/accessibility/nn_landmarks_from_stations.csv")

### Grid to landmarks via bike
# Initialize list of groceries
grid_to_landmarks <- grid_origins %>%
  dplyr::select(id) %>%
  st_drop_geometry()

# for loop: 10-60 minutes in 10-minute intervals, stations to stations via bike
for (i in seq(10, 60, by = 10)) { # start, stop, step
  mode_selected = "BICYCLE"
  origin_points = grid_origins # origins
  desination_points = landmarks # destinations
  column_labels = "nn_landmarks"
  access_i <- accessibility(r5r_core,
                                                   origins = origin_points ,
                                                   destinations = desination_points ,
                                                   mode = mode_selected,
                                                   time_window = i,
                                                   percentiles = 75,
                                                   cutoffs = i,
                                                   progress = TRUE) %>%
    dplyr::select(accessibility)
  colnames(access_i) <- paste0(column_labels, i) # rename column
  grid_to_landmarks <- append(grid_to_landmarks, access_i) # append each column
}

grid_to_landmarks <- as.data.frame(grid_to_landmarks) # convert to df
g(grid_to_landmarks) # glimpse

# Write
grid_to_landmarks %>% write_csv("data/accessibility/nn_landmarks_from_grid.csv")

#By grid cell centroid
nn_attractions_from_grid <- read_csv("data/accessibility/nn_landmarks_from_grid.csv") %>%
  merge(grid, ., by.y = "id", by.x = "unique_id") %>% # join with unique ids for geometries
  st_as_sf() %>%
  st_transform("EPSG:4326")

grid <- nn_attractions_from_grid


# Do the same, but for the current stations df - so that we can plot correlations later.
nn_attractions_from_stations <- read_csv("data/accessibility/nn_landmarks_from_stations.csv") %>%
  merge(stations, ., by.y = "id", by.x = "station") %>% # join with unique ids for geometries
  st_as_sf() %>%
  st_transform("EPSG:4326")

stations <- nn_attractions_from_stations

# Plot
nn_attractions_vars <- c("nn_landmarks10",
                         "nn_landmarks20",
                         "nn_landmarks30",
                         "nn_landmarks40",
                         "nn_landmarks50",
                         "nn_landmarks60")

nn_attractions_long <- dplyr::select(grid, all_of(nn_attractions_vars)) %>%
  rename("10 minutes" = nn_landmarks10,
         "20 minutes" = nn_landmarks20,
         "30 minutes" = nn_landmarks30,
         "40 minutes" = nn_landmarks40,
         "50 minutes" = nn_landmarks50,
         "60 minutes" = nn_landmarks60) %>%
  gather(Variable, value, -geometry)

ggplot()+
  geom_sf(data = nn_attractions_long, aes(fill=value), color = NA)+
  scale_fill_gradient(low = "#eff3ff", high = "#08519c")+
  labs(title = "Number of tourist attractions accessible within N minutes' bike ride",
       guide = "Number of tourist attractions")+
  facet_wrap(~Variable)+
  mapTheme_a()

2-7. Correlations

To deepen our analysis, we looked at correlations by plotting weekly ridership as a function of station features. Looking at the plots below, we are able to see some strong correlations with ridership, notably with the spatial lag of the 5 nearest stations which has a Pearson coefficient of 0.72. Similarly, there is a high positive correlation between ridership and accessibility, namely the number of stations within a 20 min bike ride which has a coefficient of 0.67. Beyond a 20-minute bike ride, the Pearson coefficient in relation to ridership begins to decrease. Interestingly, we see that as distance to SEPTA subway stations increases, ridership decreases suggesting there is a demand to use Indego to connect to other transit systems. For reference, we have listed prominent predictor variables used in our correlation study below, along with their respective plots.

bike_network_dist : distance (in feet) from a station to nearest bike network length
lag_knn5_ridership: the average number of rides taken from a station’s 5 nearest neighbors
nn_groceries20: the number of grocery stores that one could reach via bike from a station in 20 minutes.
nn_jobs20: the number of jobs that one could reach via bike from a station in 20 minutes.
nn_landmarks20: the number of tourist attractions that one could reach via bike from a station in 20 minutes.
nn_stations20: the number of Indego bike stations that one could reach via bike from a station in 20 minutes.
pct_commute_wlk: the proportion of residents in a station’s block group who walk to work.
rail_stop_dist: the distance (in feet) from a station to the nearest regional rail station.
subway_stop_dist: distance (in feet) from a station to nearest subway stop.

  # Prepare grid data for modelling
  
  # Extract grid cells' coordinates for our later model
grid_lat_lon <- st_centroid(grid) %>%
  dplyr::select(unique_id) %>%
  mutate(centroid_lon = sf::st_coordinates(.)[,1],
                centroid_lat = sf::st_coordinates(.)[,2]) %>%
  st_drop_geometry(.)

grid <- grid %>%
  left_join(., grid_lat_lon, by = "unique_id")

# Remove grid cells that are in the rivers
rivers <- st_read("https://opendata.arcgis.com/datasets/7c17a8e7685b404e8bcfbc7ae1b62de3_0.geojson") %>%
  st_transform("EPSG:4326")%>%
  dplyr::select(CREEK_NAME)

remove_rivers <- st_join(st_centroid(grid), rivers, join=st_intersects)

grid <- left_join(grid, st_drop_geometry(remove_rivers %>% dplyr::select(unique_id, CREEK_NAME)), by = "unique_id") %>% 
  filter(CREEK_NAME %!in% c("Delaware River", "Schuylkill River", "Reserve Basin"),
         border_cell == "NOT_BORDER_CELL") %>%
  dplyr::select(-CREEK_NAME, -border_cell)

# ==== Correlation Plots

correlation.long <- st_drop_geometry(stations) %>%
  dplyr::select(ridership_week,
                #weeks_active,
                
                #totalDocks,
                #ebike_ridership_week,
                #standard_ridership_week,
                #go_live,
                # guest_pass_week,
                # month_pass_week,
                # annual_pass_week,
                
                # demographics
                #total_pop,
                #medRent,
                #medIncome,
                # medAge,
                # pct_white,
                # pct_Black,
                # pct_Asian,
                # pct_HispLat,
                # pct_other_race,
                # pct_employed,
                #pct_rent_spent,
                # pct_commute_car,
                # pct_commute_pub,
                # pct_commute_bic,
                pct_commute_wlk,
                # pct_school,
                #pop_density,
                
                # infrastructure
                #Park_dist,                
                bike_network_dist,
                subway_stop_dist,
                #bus_trolley_stop_dist,
                rail_stop_dist,     
                #high_injury_dist,
                
                # spatial lag
                lag_knn5_ridership_week,   
                # lag_knn5_ebike_ridership_week,
                # lag_knn5_standard_ridership_week,
                
                # nn stations by bike
               #nn_stations10,
               nn_stations20,
               # nn_stations30,
               # nn_stations40,
               # nn_stations50,
               # nn_stations60,
                
                # restaurants walking distance
                #nn_restaurants_walk_10,
                # nn_restaurants_walk_20,
                # nn_restaurants_walk_30,
                # nn_restaurants_walk_40,
                # nn_restaurants_walk_50,
                # nn_restaurants_walk_60,
                
                # restaurants biking distance  
                # nn_restaurants_bike_10,
                # nn_restaurants_bike_20,
                # nn_restaurants_bike_30,
                # nn_restaurants_bike_40,
                # nn_restaurants_bike_50,
                # nn_restaurants_bike_60,
                
                # jobs in biking distance
                # nn_jobs10,
                nn_jobs20,
                # nn_jobs30,
                # nn_jobs40,
                # nn_jobs50,
                # nn_jobs60,
               
               # groceries in biking distance
               # nn_groceries10,
               nn_groceries20,
               # nn_groceries30,
               # nn_groceries40,
               # nn_groceries50,
               # nn_groceries60,
               
               # tourist attractions in biking distance
               # nn_landmarks10,
               nn_landmarks20,
               # nn_landmarks30,
               # nn_landmarks40,
               # nn_landmarks50,
               # nn_landmarks60
                ) %>%
  gather(Variable, Value, - ridership_week) 

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, ridership_week, use = "complete.obs"))
 
  ggplot(correlation.long, aes(Value, ridership_week))+
  geom_point(size = 1, color = bluePalette5[3])+
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.2, size = 3.5, color = bluePalette5[5]) +
  geom_smooth(method = lm,  se = FALSE, colour = bluePalette5[5], size = 1.2)+
  facet_wrap(~Variable, ncol = 3, scales = "free")+
  labs(title = "Ridership as a function of station features")+
  plotTheme_b()

ⅠⅠⅠ. Model

In this next section, we elaborate on the overall process of model selection and building to predict future Indego ridership. As previously mentioned, Indego has plans to build new stations in infill zones of high ridership, in areas on the fringe of the network and in surrounding expansion zones. These three focus areas are depicted below using our fishnet grid. To clarify, our final model aims to predict ridership for each grid cell taking into consideration grid cell features. Additionally, in this portion of the report we walk through various evaluative steps to help determine the accuracy and generalizability of our selected model.

#Indego Area - for train/test split
indego_area <- grid %>%
   filter(station_count > 0)

not_indego_area <- grid %>%
  filter(station_count == 0)

# Set up expansion
expansion_plans <- st_read("data/Secnario_1.shp") %>% st_transform("EPSG:4326")

# join grid to expansion plan areas
grid_expansion_plan <- st_join(st_centroid(model_grid), expansion_plans, join = st_intersects) %>%
  left_join(x=(model_grid %>% dplyr::select(unique_id)), y=st_drop_geometry(.), by = "unique_id") %>%
  mutate(expansion_plan = case_when(
    Type == "Expansion" ~ "Expansion",
     Type == "High Ridership Infill" ~ "High Ridership Infill",
     Type == "Targeted/Fringe Infill" ~ "Targeted/Fringe Infill"),
    station_count = case_when( # insert a station in each scenario grid cell unless there is already a station there
      station_count < 1 & !is.na(expansion_plan) ~ 1, # if station count is zero and it IS a scenario, return 1
      station_count < 1 & is.na(expansion_plan) ~ 0, # id station count is zero and it is NOT a scenario, return 0
      station_count > 0 ~ station_count)) %>% # if station count is not zero, return original station count
  dplyr::select(-id, -Number)

# Plot
ggplot()+
  geom_sf(data = grid_expansion_plan, aes(fill = expansion_plan), color = greyPalette5[3])+
  scale_fill_manual(values = c(bluePalette5[2], bluePalette5[4], bluePalette5[5]), na.value="transparent")+
  labs(title = "Expansion Plans",
       fill = "Expansion Plan")+
  mapTheme_a()

3-1. Model Building

data_split <- initial_split(indego_area, strata = "ridership_week", prop = 0.6)
indego_train <- training(data_split)
indego_test  <- testing(data_split)

# Set up spatial cross validation
spatial_cv <- spatial_clustering_cv(indego_area, v = 5)

# Variables for model
identification_vars <- c(
  "unique_id",
  "GEOID",
  #"geometry",
  "gridcells_per_bg"
)

indego_system_vars <- c(
  #"station_count",
  #"weeks_active",
  "totalDocks",
  "annual_pass_week",
  "ebike_ridership_week",
  "standard_ridership_week",
  "guest_pass_week",
  "lag_knn5_ebike_ridership_week",
  "lag_knn5_ridership_week",
  "lag_knn5_standard_ridership_week",
  "month_pass_week",
  "nn_stations10",
  "nn_stations20",
  "nn_stations30",
  "nn_stations40",
  "nn_stations50",
  "nn_stations60"
)


demographic_vars <- c(
  "total_pop",
  "pop_per_cell",
  "medAge",
  "pct_white",
  "pct_Asian",
  "pct_employed",
  "pct_commute_pub",
  "pct_commute_wlk",
  "pop_density",
  "job_density",
  "medIncome",
  "pct_Black",
  "pct_HispLat",
  "pct_other_race",
  "pct_commute_car",
  "pct_commute_bic",
  "pct_school",
  "total_jobs",
  "lag_pct_white",
  "lag_pct_Asian",
  "lag_pct_employed",
  "lag_pct_commute_pub",
  "lag_pct_commute_wlk",
  "lag_medAge",
  "lag_total_pop",
  "lag_pct_Black",
  "lag_pct_HispLat",
  "lag_pct_other_race",
  "lag_pct_commute_car",
  "lag_pct_commute_bic",
  "lag_pct_school",
  "lag_pop_density",
  
  "total_pop_imputed",
  "pct_Black_imputed",
  "pct_HispLat_imputed",
  "pct_other_race_imputed",
  "pct_commute_car_imputed",
  "pct_commute_bic_imputed",
  "pct_school_imputed",
  #"pop_density_imputed",
  "pct_white_imputed",
  "pct_Asian_imputed",
  "pct_employed_imputed",
  "pct_commute_pub_imputed",
  "pct_commute_wlk_imputed",
  "medAge_imputed"
)

spatial_vars <- c(
"area",
#"bike_network_dist",
"bus_trolley_stop_dist",
"centroid_lat",
"centroid_lon",
"high_injury_dist",
#"Park_dist",
"protected_bike_network_dist"
#"rail_stop_dist",
#"subway_stop_dist"
)

nn_amenities_vars <- c(
  "nn_groceries10",
  #"nn_groceries20",
  "nn_groceries30",
  "nn_groceries40",
  "nn_groceries50",
  "nn_groceries60",
  "nn_jobs10",
  #"nn_jobs20",
  "nn_jobs30",
  "nn_jobs40",
  "nn_jobs50",
  "nn_jobs60",
  "nn_landmarks10",
  #"nn_landmarks20",
  "nn_landmarks30",
  "nn_landmarks40",
  "nn_landmarks50",
  "nn_landmarks60",
  "nn_restaurants_bike_10",
  #"nn_restaurants_bike_20",
  "nn_restaurants_bike_30",
  "nn_restaurants_bike_40",
  "nn_restaurants_bike_50",
  "nn_restaurants_bike_60",
  #"nn_restaurants_walk_10",
  "nn_restaurants_walk_20",
  "nn_restaurants_walk_30",
  "nn_restaurants_walk_40",
  "nn_restaurants_walk_50",
  "nn_restaurants_walk_60"
)

vars_for_prediction <- c(
  "pop_density_imputed",
  "bike_network_dist",
  "Park_dist",
  "rail_stop_dist",
  "subway_stop_dist",
  "station_count",
  "weeks_active",
  "nn_groceries20",
  "nn_jobs20",
  "nn_landmarks20",
  "nn_restaurants_bike_20",
  "nn_restaurants_walk_10"
)

# Set model recipe
model_recipe <- recipe(ridership_week ~ ., data = st_drop_geometry(indego_train)) %>%
  update_role(identification_vars, new_role = "IDs") %>%
  update_role(c(indego_system_vars, demographic_vars, spatial_vars, nn_amenities_vars), new_role = "not_for_prediction") %>%
  step_center(all_predictors(), -ridership_week) %>%
  step_scale(all_predictors(), -ridership_week) %>%
  step_ns(centroid_lat, centroid_lon, options = list(df = 4))

# See the data after all transformations
glimpse(model_recipe %>% prep() %>% juice())

# Set model specifications
# lm_plan <-
#   linear_reg() %>%
#   set_engine("lm")

# glmnet_plan <-
#   linear_reg() %>%
#   set_args(penalty  = tune()) %>%
#   set_args(mixture  = tune()) %>%
#   set_engine("glmnet")

rf_plan <- rand_forest() %>%
  set_args(mtry  = tune()) %>%
  set_args(min_n = tune()) %>%
  set_args(trees = 1000) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")

XGB_plan <- boost_tree() %>%
  set_args(mtry  = tune()) %>%
  set_args(min_n = tune()) %>%
  set_args(trees = 100) %>% 
  set_engine("xgboost") %>% 
  set_mode("regression")

# nnet_plan <- mlp(
#   mode = "regression",
#   engine = "nnet",
#   hidden_units = 5,
#   penalty = NULL,
#   dropout = 0.1,
#   epochs = 50,
#   activation = "linear",
#   learn_rate = NULL
# ) %>%
#   set_engine("keras", act_funs = "linear")

# svml_plan <- svm_linear(
#   mode = "regression",
#   engine = "LiblineaR",
#   cost = NULL,
#   margin = NULL) %>%
#   set_engine("LiblineaR") %>%
#   set_mode("regression")

poi_plan <- poisson_reg(
  mode = "regression",
  penalty = 1,
  mixture = 1, # lasso
  engine = "glmnet") %>% 
  set_engine("glmnet") %>%
  set_mode("regression")

xp_plan <- boost_tree(
  mode = "regression") %>%
  set_args(mtry  = tune()) %>%
  set_args(min_n = tune()) %>%
  set_args(trees = 100,
           max_depth = 15) %>% 
  set_engine("xgboost", objective = "count:poisson") %>% 
  set_mode("regression")

# Establish hyperparameters
#for glmnet (penalization)
# glmnet_grid <- expand.grid(penalty = seq(0, 1, by = .25),
#                            mixture = seq(0,1,0.25))

rf_grid <- expand.grid(mtry = c(2,5), 
                       min_n = c(1,5))

xgb_grid <- expand.grid(mtry = c(3,5), 
                        min_n = c(1,5))

xp_grid <- expand.grid(mtry = c(8, 10),
                        min_n = c(20, 25, 30))

# Create workflows for each model
# lm_wf <-
#   workflow() %>%
#   add_recipe(model_recipe) %>%
#   add_model(lm_plan)

# glmnet_wf <-
#   workflow() %>%
#   add_recipe(model_recipe) %>%
#   add_model(glmnet_plan)

rf_wf <-
  workflow() %>% 
  add_recipe(model_recipe) %>% 
  add_model(rf_plan)

xgb_wf <-
  workflow() %>% 
  add_recipe(model_recipe) %>% 
  add_model(XGB_plan)

# nnet_wf <- workflow() %>%
#   add_recipe(model_recipe) %>%
#   add_model(nnet_plan)

# svml_wf <- workflow() %>%
#   add_recipe(model_recipe) %>%
#   add_model(svml_plan)

poi_wf <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(poi_plan)

xp_wf <- workflow() %>%
  add_recipe(model_recipe) %>%
  add_model(xp_plan)


# fit model to workflow and calculate metrics
control <- control_resamples(save_pred = TRUE, verbose = TRUE)
metrics <- metric_set(rmse, rsq, mape, smape)
# rmse - root mean squared error: diff between actual and expected results, squared
# rsq - R-squared: proportion of variance explained
# mape - mean absolute percent error
# smape - symmetric mean absolute percentage error

# lm_tuned <- lm_wf %>%
#   tune::fit_resamples(.,
#                       resamples = spatial_cv,
#                       control   = control,
#                       metrics   = metrics)

# glmnet_tuned <- glmnet_wf %>%
#   tune::tune_grid(.,
#                   resamples = spatial_cv,
#                   grid      = glmnet_grid,
#                   control   = control,
#                   metrics   = metrics)

rf_tuned <- rf_wf %>%
  tune::tune_grid(.,
                  resamples = spatial_cv,
                  grid      = rf_grid,
                  control   = control,
                  metrics   = metrics)

xgb_tuned <- xgb_wf %>%
  tune::tune_grid(.,
                  resamples = spatial_cv,
                  grid      = xgb_grid,
                  control   = control,
                  metrics   = metrics)

# nnet_tuned <- nnet_wf %>%
#   tune::fit_resamples(.,
#                       resamples = spatial_cv,
#                       grid    = nnet_grid,
#                       control = control,
#                       metrics = metrics)

# svml_tuned <- svml_wf %>%
#   tune::fit_resamples(.,
#                       resamples = spatial_cv,
#                       control = control,
#                       metrics = metrics)

poi_tuned <- poi_wf %>%
  tune::fit_resamples(.,
                      resamples = spatial_cv,
                      control = control,
                      metrics = metrics)

xp_tuned <- xp_wf %>%
  tune::tune_grid(.,
                  resamples = spatial_cv,
                  grid      = xp_grid,
                  control   = control,
                  metrics   = metrics)

We built our machine learning models to predict the average number of rides per week for each 1000 sq ft grid cell over the course of a year, where each grid cell has at least one Indego station within it. Our models’ input data is made up of Indego’s 900,000+ rides in 2022 spread across approximately 160 grid cells which host 183 stations.

Each cell contains information on its characteristics that we built in the previous exploratory analysis section. This information includes socio-economic demographics and spatial proximity to different amenities. Our models use this information to predict the average number of weekly rides a station in each cell will see. In order to reduce network bias, we removed Indego-system specific varaibles from our models, such as distance to nearest stations and the spatial lag of ridership. We also removed variables with high multicollinearity to boost model accuracy. To reduce racial and economic bias from our models, which will be trained on Indego’s footprint in Center City - a highly white and wealthy location - we removed racial demographic information from the model. We tested different combinations of the remaining features to find the best combination that provided the most accurate predictions. Ultimately, our model predicts ridership based on the following features:

Population density;
Proximity to: bike path network, parks, Center City, regional rail, and subway stations;
The number of stations present in a grid cell;
The average number of weeks that stations in each grid cell have been active in 2022;
The number of amenities and services that are bikeable within 20 minutes, including: grocery stores, jobs, tourist attractions, and restaurants;
The number of restaurants walkable within 10 minutes.

We built and tested eight different machine learning models to find the best fit for our purpose. The model types we investigated include:

linear regression (lm),
generalized linear model (glmnet),
random forest (rf),
XG Boost (XGB),
neural network (nnet),
support vector machine (svml),
poisson regression (poi), and
XGB-Poisson (xp).

For brevity, our report will focus on comparing just four of these models: random forest (rf), XG Boost (xgb), poisson (poi), and XGB-Poisson (xp). However, code is provided for all eight models.

To support our investigation we split our Indego ridership dataset into training and test sets, keeping 60% of our data (95 grid cells) for training our models, and the remaining 40% (67 grid cells) to test our models’ predictions with actual ridership data. We set up a spatial K-fold cross validation using 5 folds as our cross-validation set. Finally, we established a workflow for each model and set hyperparameter grids when relevant. As part of our model tuning, we tested 320 combinations of hypermarameters to identify the combination with the lowest model error.

3-2. Evaluation

We evaluated our models’ performance as it trained on new data through a spatial k-fold cross validation with 5 folds. As we can see from below, the XGB-Poisson (xp) model appears to predict the most accurately given that its error distribution is clustered around the vertical zero mark. In other words, when predicting ridership, a high number of grid cells have minimal difference between our predictions and the actual ridership. The other models, however, generally perform poorly given that their error distribution appear further spread out from zero.

# Metrics across grid

#autoplot(xgb_tuned)
collect_metrics(xgb_tuned)

#collect_metrics(lm_tuned)
#collect_metrics(glmnet_tuned)
collect_metrics(rf_tuned)
#collect_metrics(nnet_tuned)
#collect_metrics(svml_tuned)
collect_metrics(poi_tuned)
collect_metrics(xp_tuned)

# 'Best' by some metric and margin
#show_best(lm_tuned, metric = "rsq", n = 15)
#show_best(glmnet_tuned, metric = "rsq", n = 15)
show_best(rf_tuned, metric = "rsq", n = 15)
show_best(xgb_tuned, metric = "rsq", n = 15)
#show_best(nnet_tuned, metric = "rsq", n = 15)
#show_best(svml_tuned, metric = "rsq", n = 15)
show_best(poi_tuned, metric = "rsq", n = 15)
show_best(xp_tuned, metric = "rsq", n = 15)

#lm_best_params     <- select_best(lm_tuned, metric = "rmse"    )
#glmnet_best_params <- select_best(glmnet_tuned, metric = "rmse")
rf_best_params     <- select_best(rf_tuned, metric = "rmse"    )
xgb_best_params    <- select_best(xgb_tuned, metric = "rmse"   )
#nnet_best_params   <- select_best(nnet_tuned, metric = "rmse")
#svml_best_params   <- select_best(svml_tuned, metric = "rmse")
poi_best_params   <- select_best(poi_tuned, metric = "rmse")
xp_best_params   <- select_best(xp_tuned, metric = "rmse")

# Final workflow
#lm_best_wf     <- finalize_workflow(lm_wf, lm_best_params)
#glmnet_best_wf <- finalize_workflow(glmnet_wf, glmnet_best_params)
rf_best_wf     <- finalize_workflow(rf_wf, rf_best_params)
xgb_best_wf    <- finalize_workflow(xgb_wf, xgb_best_params)
#nnet_best_wf   <- finalize_workflow(nnet_wf, nnet_best_params)
#svml_best_wf   <- finalize_workflow(svml_wf, svml_best_params)
poi_best_wf   <- finalize_workflow(poi_wf, poi_best_params)
xp_best_wf   <- finalize_workflow(xp_wf, xp_best_params)


# last_fit() emulates the process where, after determining the best model, the final fit on the entire training set is needed and is then evaluated on the test set.
# lm_val_fit_geo <- lm_best_wf %>%
#   last_fit(split     = data_split,
#            control   = control,
#            metrics   = metrics)
# 
# collect_metrics(lm_val_fit_geo)

# glmnet_val_fit_geo <- glmnet_best_wf %>%
#   last_fit(split     = data_split,
#            control   = control,
#            metrics   = metrics)
# 
# collect_metrics(glmnet_val_fit_geo)

rf_val_fit_geo <- rf_best_wf %>%
  last_fit(split     = data_split,
           control   = control,
           metrics   = metrics)

collect_metrics(rf_val_fit_geo)

xgb_val_fit_geo <- xgb_best_wf %>% 
  last_fit(split     = data_split,
           control   = control,
           metrics   = metrics)

collect_metrics(xgb_val_fit_geo)

# nnet_val_fit_geo <- nnet_best_wf %>%
#    last_fit(split     = data_split,
#            control   = control,
#             metrics   = metrics)
# 
# collect_metrics(nnet_val_fit_geo)

# svml_val_fit_geo <- svml_best_wf %>%
#    last_fit(split     = data_split,
#            control   = control,
#             metrics   = metrics)
# 
# collect_metrics(svml_val_fit_geo)

poi_val_fit_geo <- poi_best_wf %>%
   last_fit(split     = data_split,
           control   = control,
            metrics   = metrics)
 
collect_metrics(poi_val_fit_geo)

xp_val_fit_geo <- xp_best_wf %>%
   last_fit(split     = data_split,
           control   = control,
            metrics   = metrics)
 
collect_metrics(xp_val_fit_geo)

# Pull best hyperparam preds from out-of-fold predictions
#lm_best_OOF_preds <- collect_predictions(lm_tuned)

# glmnet_best_OOF_preds <- collect_predictions(glmnet_tuned) %>%
#   filter(penalty  == glmnet_best_params$penalty[1] & mixture == glmnet_best_params$mixture[1])

rf_best_OOF_preds <- collect_predictions(rf_tuned) %>%
  filter(mtry  == rf_best_params$mtry[1] & min_n == rf_best_params$min_n[1])

xgb_best_OOF_preds <- collect_predictions(xgb_tuned) %>% 
  filter(mtry  == xgb_best_params$mtry[1] & min_n == xgb_best_params$min_n[1])

#nnet_best_OOF_preds <- collect_predictions(nnet_tuned)

#svml_best_OOF_preds <- collect_predictions(svml_tuned)

poi_best_OOF_preds <- collect_predictions(poi_tuned)

xp_best_OOF_preds <- collect_predictions(xp_tuned)%>% 
  filter(mtry  == xp_best_params$mtry[1] & min_n == xp_best_params$min_n[1])


# collect validation set predictions from last_fit model
#lm_val_pred_geo     <- collect_predictions(lm_val_fit_geo)
#glmnet_val_pred_geo <- collect_predictions(glmnet_val_fit_geo)
rf_val_pred_geo     <- collect_predictions(rf_val_fit_geo)
xgb_val_pred_geo    <- collect_predictions(xgb_val_fit_geo)
#nnet_val_pred_geo    <- collect_predictions(nnet_val_fit_geo)
#svml_val_pred_geo    <- collect_predictions(svml_val_fit_geo)
poi_val_pred_geo    <- collect_predictions(poi_val_fit_geo)
xp_val_pred_geo    <- collect_predictions(xp_val_fit_geo)

# Aggregate OOF predictions (they do not overlap with Validation prediction set)
OOF_preds <- rbind(#data.frame(dplyr::select(lm_best_OOF_preds, .pred, ridership_week), model = "lm"),
                   #data.frame(dplyr::select(glmnet_best_OOF_preds, .pred, ridership_week), model = "glmnet"),
                   data.frame(dplyr::select(rf_best_OOF_preds, .pred, ridership_week), model = "rf"),
                   data.frame(dplyr::select(xgb_best_OOF_preds, .pred, ridership_week), model = "xgb"),
                   #data.frame(dplyr::select(nnet_best_OOF_preds, .pred, ridership_week), model ="nnet"),
                   #data.frame(dplyr::select(svml_best_OOF_preds, .pred, ridership_week), model ="svml"),
                   data.frame(dplyr::select(poi_best_OOF_preds, .pred, ridership_week), model ="poi"),
                   data.frame(dplyr::select(xp_best_OOF_preds, .pred, ridership_week), model ="xp")
                              ) %>% 
  group_by(model) %>% 
  mutate(#ridership_week = log(ridership_week),
        error = .pred - ridership_week,
        pct_error = (.pred - ridership_week) / ridership_week,
         RMSE = yardstick::rmse_vec(ridership_week, .pred),
         MAE  = yardstick::mae_vec(ridership_week, .pred),
         MAPE = yardstick::mape_vec(ridership_week, .pred),
        RSQ = yardstick::rsq_vec(ridership_week, .pred),
        MASE = yardstick::mase_vec(ridership_week, .pred)) %>% 
  ungroup() %>% 
  mutate(model = factor(model, levels=c(#"lm",
                                        #"glmnet",
                                        "rf",
                                        "xgb",
                                        #"nnet",
                                        #"svml",
                                        "poi",
                                        "xp"
                                        )))


#Summary of model metrics for OOF predictions
OOF_metrics <- OOF_preds %>%
  group_by(model) %>%
  summarize(avg_RMSE = mean(RMSE),
            avg_MAE = mean(MAE),
            avg_MAPE = mean(MAPE),
            avg_RSQ = mean(RSQ),
            avg_MASE = mean(MASE))

# Plot histogram of error
ggplot()+
  geom_histogram(data = (OOF_preds %>%
         dplyr::select(model, error) %>%
         distinct()),
         aes(x=error, stat = "identity", binwidth = 5), color = "#08519c", fill = "#3182bd", alpha = 0.3)+
  labs(title = "Distribution of Errors by Model")+
  facet_wrap(~model)

To dive deeper, we chose to explore average Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE), and Root Mean Square Error (RMSE) of our out-of-fold predictions. Looking at the results, our XGB-Poisson (xp) model has the lowest overall error with a MAPE of 56.8%. Looking at raw numbers, this model also has the lowest MAE, with its average number of incorrectly predicted rides per grid cell at just 45. Given the high variability of ridership across all stations, we consider this level of error to be acceptable. A slightly higher RMSE of 64 does not differ too drastically from that of other models. Generally our poisson (poi) model appears to perform the worst.

# average error for each model
grid.arrange(ncol=1,

ggplot(data = OOF_preds %>% 
         dplyr::select(model, MAPE) %>% 
         distinct() , 
       aes(x = model, y = MAPE, group = 1)) +
  geom_path(color = bluePalette5[3]) +
  geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
  labs(title = "Average MAPE by model")+
  theme_bw(),
  
ggplot(data = OOF_preds %>% 
         dplyr::select(model, MAE) %>% 
         distinct() , 
       aes(x = model, y = MAE, group = 1)) +
  geom_path(color = bluePalette5[3]) +
  geom_label(aes(label = paste0(round(MAE)))) +
  labs(title = "Average MAE by model")+
  theme_bw(),
  
ggplot(data = OOF_preds %>% 
         dplyr::select(model, RMSE) %>% 
         distinct() , 
       aes(x = model, y = RMSE, group = 1)) +
  geom_path(color = bluePalette5[3]) +
  geom_label(aes(label = paste0(round(RMSE)))) +
  labs(title = "Average RMSE by model")+
  theme_bw())

3-2-1. Accuracy

In this section, we evaluate our trained models’ performance on how well they predict the ridership of our test set. The results of our test set show that our XGB-Poisson model offers the highest accuracy of the other models: its Mean Absolute Percentage Error (MAPE) is 46%, means that our XGB-Poisson model’s predictions over or under predicts, on average, by 46%. Our random forest and XG Boost models (rf and xgb) both have the highest average MAPE scores.

Our XGB-Poisson model also has the lowest Mean Absolute Error (MAE), at 39.9, meaning that on average, its predictions are off by about 40 rides per grid cell per week.

Finally, while our XGB-Poisson model doesn’t have the absolute lowest Root Mean Squared Error (RMSE) rate of all the models, its error rate at 59.1 is comparable to the other models. Given its superior performance on the other metrics, we selected XGB-Poisson as the model for our final predictions.

# Aggregate predictions from Validation set
val_preds <- rbind(#data.frame(lm_val_pred_geo, model = "lm"),
                   #data.frame(glmnet_val_pred_geo, model = "glmnet"),
                   data.frame(rf_val_pred_geo, model = "rf"),
                   data.frame(xgb_val_pred_geo, model = "xgb"),
                   #data.frame(nnet_val_pred_geo, model = "nnet"),
                   #data.frame(svml_val_pred_geo, model = "svml"),
                   data.frame(poi_val_pred_geo, model = "poi"),
                   data.frame(xp_val_pred_geo, model = "xp")
                              ) %>% 
  left_join(., (indego_area %>% dplyr::select(-ridership_week)) %>% 
              rowid_to_column(var = ".row"), 
            by = ".row") %>% 
  group_by(model) %>%
  mutate(error = .pred - ridership_week,
         pct_error = (.pred - ridership_week) / ridership_week,
         RMSE = yardstick::rmse_vec(ridership_week, .pred),
         MAE  = yardstick::mae_vec(ridership_week, .pred),
         MAPE = yardstick::mape_vec(ridership_week, .pred)) %>% 
  ungroup() %>% 
  mutate(model = factor(model, levels=c(#"lm",
                                        #"glmnet",
                                        "rf",
                                        "xgb",
                                        #"nnet",
                                        #"svml",
                                        "poi",
                                        "xp"
                                        )))

# plot MAPE by model type
test_v_train_metrics <- grid.arrange(ncol = 1,

# plot MAPE by model type
ggplot(data = val_preds %>% 
         dplyr::select(model, MAPE) %>% 
         distinct() , 
       aes(x = model, y = MAPE, group = 1)) +
  geom_path(color = bluePalette5[3]) +
  geom_label(aes(label = paste0(round(MAPE,1),"%"))) +
  labs(title = "Training vs Test Set MAPE")+
  theme_bw(),

# plot MAE by model type
ggplot(data = val_preds %>% 
         dplyr::select(model, MAE) %>% 
         distinct() , 
       aes(x = model, y = MAE, group = 1)) +
  geom_path(color = bluePalette5[3]) +
  geom_label(aes(label = paste0(round(MAE,1)))) +
  labs(title = "Training vs Test Set MAE")+
  theme_bw(),

# plot RMSE by model type
ggplot(data = val_preds %>% 
         dplyr::select(model, RMSE) %>% 
         distinct() , 
       aes(x = model, y = RMSE, group = 1)) +
  geom_path(color = bluePalette5[3]) +
  geom_label(aes(label = paste0(round(RMSE,1)))) +
  labs(title = "Train vs Test RMSE")+
  theme_bw()
)

In the plots below the dashed and thick blue lines represent the actual and predicted data respectively. It appears that all our models have difficulty accurately predicting ridership in cells with high ridership. However, of all our models, our XGB-Poisson (xp) model once again appears to offer the best fit: most of the predictions closely hug the true outcome line, reducing its variance and bias.

# Validation Predicted vs. actual
ggplot(val_preds, aes(x = ridership_week, y = .pred, group = model)) +
  geom_point(shape=21, color=greyPalette5[4], alpha=0.6, size=1, stroke=0.5, fill="#AFF708") +
  geom_abline(linetype = "dashed", color = greyPalette5[3], stroke=0.3) +
  geom_smooth(method = "lm", linetype = 1, color = bluePalette5[5], lwd=0.7) +
  coord_equal() +
  facet_wrap(~model, nrow = 2) +
  labs(title = "Training vs Test Set: Predictions vs Actuals",
       x= "Actual ridership",
       y = "Predicted ridership")+
  plotTheme_c()

3-2-2. Generalizability

When looking more closely at how well each model performs spatially across Philadelphia, we see that our XGB-Poisson model has less error across space than other models. However, our models have a difficult time accurately predicting ridership on the edges of the current Indego network, where stations are newer.

# join test data back to make spatial
val_pred_sf <- val_preds %>% 
  group_by(model) %>% 
  rowwise() %>% 
  mutate(RMSE = yardstick::rmse_vec(ridership_week, .pred),
         MAE  = yardstick::mae_vec(ridership_week, .pred),
         MAPE = yardstick::mape_vec(ridership_week, .pred)) %>%
  st_as_sf(., crs = 4326)

# Plot
ggplot()+
  geom_sf(data = model_grid, fill="white", color=greyPalette5[2], size = 0.1)+
  geom_sf(data = val_pred_sf, aes(fill = q5(MAE)))+
  scale_fill_manual(values = bluePalette5,labels=round(as.numeric(qBr(val_pred_sf,"MAE"))),name="MAE (%)\n(quantiles)")+
  facet_wrap(~model, ncol = 2)+
  ylim(39.92, 40.02)+
  xlim(-75.23, -75.12)+
  labs(title = "Test vs Actual MAE by model")+
  mapTheme_a()

ggplot()+
  geom_sf(data = model_grid, fill="white", color=greyPalette5[2], size = 0.1)+
  geom_sf(data = val_pred_sf, aes(fill = MAE))+
  scale_fill_gradient(high = "#08519c", low = "#eff3ff")+
  facet_wrap(~model, ncol = 2)+
  ylim(39.92, 40.02)+
  xlim(-75.23, -75.12)+
  labs(title = "Test vs Actual MAE by model")+
  mapTheme_a()

## Final fit on all data, then extract model
#full_fit_lm     <- lm_best_wf %>% fit(indego_area)
#full_fit_glmnet <- glmnet_best_wf %>% fit(indego_area)
full_fit_rf     <- rf_best_wf %>% fit(indego_area)
full_fit_xgb    <- xgb_best_wf %>% fit(indego_area)
#full_fit_nnet   <- nnet_best_wf %>% fit(indego_area)
#full_fit_svml   <- svml_best_wf %>% fit(indego_area)
full_fit_poi   <- poi_best_wf %>% fit(indego_area)
full_fit_xp   <- xp_best_wf %>% fit(indego_area)

3-3. Predictions

We fed Philadelphia’s entire grid back into our final model to predict ridership across the city. Our results are plotted below. The first shows the percentile of predicted weekly ridership for each grid cell. The second shows the number of rides a grid cell is expected to see, for grid cells in the 80th percentiles and higher. This lets us better understand the variation of predicted ridership within the highest ridership areas.

Each grid cell is assumed to have one more bike station than it otherwise does, and all stations are assumed to be active for one year. Since we have no real-world data to ground-truth these predictions, they should be taken as general estimates. This data is fed into our web app to provide predicted estimates of future Indego ridership.

# Read back in the model
models_folder <- board_folder("models/")
indego_model <- vetiver_pin_read(models_folder, 'Indego-Ridership-Model', version = NULL)

# set up grid with +1 stations
grid_stations_plus1 <- model_grid %>%
  mutate(station_count = station_count + 1,
         weeks_active = 52)

# Predict
prediction_grid_xp <- predict(indego_model, new_data = grid_stations_plus1) %>% 
   rename(predicted_ridership_week = .pred) %>%
  mutate(prediction_percentile =
           paste0(floor(((ecdf(predicted_ridership_week)(predicted_ridership_week))*100)/10)*10,'th Percentile'),
         prediction_percentile = case_when(
           prediction_percentile == "100th Percentile" ~ "99th Percentile",
           TRUE ~ prediction_percentile
         ))

# Join to grid
prediction_entire_grid <- cbind(model_grid, prediction_grid_xp)

# Plot
grid.arrange(ncol = 1,
            
             ggplot()+
  geom_sf(data = prediction_entire_grid, aes(fill = prediction_percentile), alpha = 0.7)+
  scale_fill_manual(values = bluePalette10)+
  labs(title = "Ridership Predictions for all of Philadelphia",
       fill = "Predicted\nRidership\nper Week")+
  mapTheme_a(),
  
  ggplot()+
  geom_sf(data = prediction_entire_grid, fill=greyPalette5[1], color=greyPalette5[2], size = 0.1) +
  geom_sf(data = prediction_entire_grid %>% filter(prediction_percentile %in% c('80th Percentile', '90th Percentile', '99th Percentile')), aes(fill = predicted_ridership_week))+
  scale_fill_gradient(low = "#9ecae1", high = '#081d58') +#"#08519c")+
  labs(title = "Ridership Predictions: 80th Percentiles and Above",
       subtitle = "Focused on Central Philadelphia",
       fill = "Predicted\nRidership\nper Week")+
  ylim(39.897, 39.99)+
  xlim(-75.23, -75.12)+
  mapTheme_a())

3-4. Testing our Model on 2023 Q1 New Station Ridership

The best way to test our model’s performance is with real-world data. We have that opportunity with Indego’s latest ridership data from Q1 of 2023. Indego added 23 stations in the first quarter of 2023. How well did our model predict their performance?

To answer this, we aggregated Q1’s new stations’ ridership data and joined it to our existing analysis grid. This provided us with 23 grid cells that saw a new station in that period. We then predicted weekly ridership for these 23 grid cells with the new number of stations in each. Our model correctly predicted weekly ridership within 19 rides for each grid cell.

A map of the error (number of rides that our prediction was off by) is below. Our model over predicted ridership for grid cells in Center City. We believe this is because our model is primed to predict the average weekly ridership for a grid cell across an entire year. Q1’s ridership only captures ridership during the lull of the winter season. Therefore it is expected that our model would over predict in this instance. As the weather warms through the year and more people begin to bike, we are confident that our model’s estimations at the year end would be more accurate.

# Import data
# Ridership data
q1_2023 <- st_read("./data/Q1_2023/indego-trips-2023-q1.csv") %>%
  mutate(start_station = case_when(
    start_station == "3013" ~ "3286",
    start_station == "3108" ~ "3004",
    start_station == "3202" ~ "3296",
    start_station == "3206" ~ "3294",
    start_station == "3167" ~ "3295",
    start_station == "3011" ~ "3265",
    TRUE ~ start_station
  )) %>% 
  filter(start_station != "3246" & start_station != "3241" & start_station != "3271" &  start_station != "3287")

# stations list
q1_2023_stations_list <- st_read("./data/Q1_2023/indego-stations-2023-04-01.csv") %>% filter(Status == "Active")

# Geojson with stations geography
q1_geojson <- st_read("data/Q1_2023/Q1_2023_statuses.json") %>% 
  filter(kioskPublicStatus == "Active") %>% 
  dplyr::select(-bikes)

# clean Q1 2023 data
q1_2023_ridership_summary <- q1_2023 %>% 
  mutate(week = lubridate::week(as.POSIXct(start_time, tz = "", "%m/%d/%Y %H:%M"))) %>%
  filter(passholder_type %!in% c("NULL", "Walk-up")) %>%
  group_by(start_station, week) %>%
  summarize(stn_lat = median(as.double(start_lat), na.rm = T),
            stn_lon = median(as.double(start_lon), na.rm = T),
            ridership_week_total = length(unique(trip_id)),
            ebike_ridership_week_total = length(unique(trip_id[bike_type == "electric"])), 
            standard_ridership_week_total = length(unique(trip_id[bike_type == "standard"])),
            guest_pass_total = length(unique(trip_id[passholder_type == "Day Pass"])),
            month_pass_total = length(unique(trip_id[passholder_type == "Indego30"])),
            annual_pass_total = length(unique(trip_id[passholder_type == "Indego365"]))) %>%
  group_by(start_station) %>%
  summarize(stn_lat = median(as.double(stn_lat), na.rm = T),
            stn_lon = median(as.double(stn_lon), na.rm = T),
            ridership_week = median(ridership_week_total, na.rm = T),
            ebike_ridership_week = median(ebike_ridership_week_total, na.rm = T),
            standard_ridership_week = median(standard_ridership_week_total, na.rm = T),
            guest_pass_week = median(guest_pass_total, na.rm = T),
            month_pass_week = median(month_pass_total, na.rm = T),
            annual_pass_week = median(annual_pass_total, na.rm = T),
            weeks_active = length(unique(week))) %>%
  left_join(indego_stations, by = c("start_station" = "Station_ID"))


# Join to GeoJSON
q1_2023_ridership_by_station <- q1_geojson %>% 
  mutate(id=as.character(id)) %>% 
  full_join(., q1_2023_ridership_summary, by=c("id"="start_station")) %>% 
  # create geometry
  st_as_sf(coords = c(x="longitude", y="latitude"), crs=st_crs("EPSG:4326")) %>%
  # remove certain stations
  filter(id != 3000 & id != 3121 & id != 3186) %>%
  rename("go_live" = "Day.of.Go_live_date",
         "station" = "id",
         "station_name" = "Station_Name",
         "status" = "Status") %>%
  mutate(go_live = as.POSIXct(go_live, tz="", format = "%m/%d/%Y")) %>% 
  mutate(go_live_year = as.numeric(format(go_live, "%Y"))) %>%
  filter(!is.na(ridership_week)) %>%
  # remove unneccessary columns
  dplyr::select(-docksAvailable,
                -bikesAvailable,
                -classicBikesAvailable,
                -smartBikesAvailable, 
                -electricBikesAvailable, 
                -rewardBikesAvailable,
                -rewardDocksAvailable,
                -kioskStatus,
                -kioskPublicStatus,
                -kioskConnectionStatus, 
                -kioskType,
                -station_name,
                -addressStreet,           
                -addressCity,            
                -addressState,            
                -addressZipCode,         
                -closeTime,               
                -eventEnd,               
                -eventStart,              
                -isEventBased,           
                -isVirtual,               
                -kioskId,                
                -notes,                  
                -openTime,               
                -publicText,              
                -timeZone,               
                -trikesAvailable,         
                -latitude,               
                -longitude,               
                -coordinates,         
                -stn_lat,                 
                -stn_lon,
                -field_5,                 
                -field_6,                
                -field_7,                
                -field_8,               
                -field_9,                 
                -field_10,               
                -field_11,                
                -field_12,               
                -field_13,                
                -field_14,               
                -field_15,               
                -field_16,               
                -field_17,                
                -field_18,               
                -field_19)

# Join to stations df
q1_ridership <- q1_2023_ridership_by_station %>%
  left_join(., q1_2023_stations_list, by = c("station" = "Station_ID")) %>%
  dplyr::select(-field_5,                 
                -field_6,                
                -field_7,                
                -field_8,               
                -field_9,                 
                -field_10,               
                -field_11,                
                -field_12) %>%
  mutate(Day.of.Go_live_date = as.POSIXct(Day.of.Go_live_date, tz="", format = "%m/%d/%Y"),
         go_live_day_year = year(Day.of.Go_live_date))

# identify new stations
new_stations <- q1_ridership %>%
  filter(go_live_day_year == 2023) %>%
  mutate(new_stations = 1) %>%
  rename(actual_ridership_week = ridership_week,
         new_station_weeks_active = weeks_active) %>%
  dplyr::select(station, name, actual_ridership_week, new_station_weeks_active)

new_stations_ids <- unique(new_stations$station)


# Join to grid
new_stations_grid_cells <- st_join(new_stations, model_grid, join = st_intersects) %>%
  group_by(unique_id) %>%
  summarize(new_stations = length(unique(name)),
            new_station_weeks_active = mean(new_station_weeks_active),
            new_station_actual_ridership = sum(actual_ridership_week))

# Find all ridership for all grid cells

# Prepare Q1 total ridership to join with grid
q1_ridership <- q1_ridership %>%
  # Identify new stations
  mutate(new_stations = case_when( 
           station %in% new_stations_ids ~ 1,
           TRUE ~ 0),
         # Identify number of active weeks - different for new and existing stations
         new_stations_weeks_active_q1 = ifelse(new_stations == 1, weeks_active, 0),
         old_stations_weeks_active_q1 = ifelse(new_stations == 0, weeks_active, 0)) %>%
  rename(actual_q1_ridership_week = ridership_week) %>%
  dplyr::select(station, actual_q1_ridership_week, new_stations, new_stations_weeks_active_q1, old_stations_weeks_active_q1)

# Aggregate Q1 station ridership by grid cells
all_q1_for_grid <- st_join(q1_ridership, model_grid, join = st_intersects) %>%
  st_drop_geometry() %>%
  group_by(unique_id) %>%
  summarize(actual_q1_ridership_week = sum(actual_q1_ridership_week), # true weekly ridership in Q1
            new_stations = sum(new_stations),
            # weeks active for Q1: average of old stations (2022 + 2023) and new stations
             q1_weeks_active = mean(c((weeks_active + old_stations_weeks_active_q1), new_stations_weeks_active_q1))) %>%
  filter(!is.na(unique_id))
  

# Join Q1 ridership to grid
q1_new_stations_for_prediction <- left_join(model_grid, all_q1_for_grid, by = "unique_id") %>%
  filter(new_stations > 0) %>%
  mutate(station_count = station_count + new_stations,
         weeks_active = q1_weeks_active)

# Predict ridership

# Predict
q1_prediction <-  predict(indego_model, new_data = q1_new_stations_for_prediction) %>%
 rename(predicted_ridership_week = .pred)

# Join to grid, collect metrics
full_q1_prediction <- cbind(q1_new_stations_for_prediction, q1_prediction) %>%
  mutate(error = actual_q1_ridership_week - predicted_ridership_week,
         abs_error = abs(error),
         pct_error = (actual_q1_ridership_week - predicted_ridership_week)/actual_q1_ridership_week)

mape <- mean(abs(full_q1_prediction$pct_error))
mae <- mean(abs(full_q1_prediction$error))

# Narrow down to important metrics
full_q1_prediction_short <- full_q1_prediction %>%
  dplyr::select(unique_id, station_count, actual_q1_ridership_week, predicted_ridership_week, error, pct_error)

# Plot
grid.arrange(ncol = 1,
             
             ggplot() +
  geom_sf(data = blockgrps, fill=greyPalette5[1], color=greyPalette5[2], size = 0.1)+
  geom_sf(data = fishnet, fill = "transparent", color=bluePalette5[3], size=1)+
  geom_sf(data = stations, shape=21, color = greyPalette5[4], alpha = 0.5, size = 1, stroke = 0.5, fill = bluePalette5[3])+
  geom_sf(data = full_q1_prediction, aes(fill = q5(abs_error)), color="black", stroke = 2)+
  scale_fill_manual(values = bluePalette5,
                    labels=round(as.numeric(qBr(full_q1_prediction,"abs_error"))),
                      name="Absolute Error\n(quantiles)")+
  geom_sf(data = new_stations, shape=21, color=greyPalette5[4], alpha=0.8, size=2, stroke=1, fill="#AFF708")+
  ylim(39.89, 40.01)+
  xlim(-75.23, -75.12)+
  labs(title = "Q1 2023 Prediction Results: Error by Grid Cell",
       subtitle = paste0("New stations in green, existing stations in gray\nMAE: ", round(mae,1)))+
  mapTheme_a(),
  
ggplot()+
  geom_histogram(data = full_q1_prediction, aes(x = error), 
                 color = '#081d58',  fill = "#9ecae1",
                 show.legend = FALSE, binwidth = 5)+
  geom_vline(data = full_q1_prediction, aes(xintercept = mae),linetype="dotted")+
  annotate("text", x = mae, y = 5, label = paste0("\nAverage: ",round(mae, 1)), angle = 90)+
  scale_y_continuous(breaks = seq(-0,6,1))+
  labs(title = "Q1 2023 Prediction Results: Distribution of Error Frequency",
       subtitle = paste0("MAE: ", round(mae,1)),
       fill = "Absolute Error",
       x = "Prediction error - number of rides our prediction was off by, per grid cell",
       y = "Number of Predictions")+
  plotTheme_b())

Ⅳ. Conclusion

OTIS with the support of Bicycle Transit Systems (BTS) is looking to expand Philadelphia’s Indego bike share network. They currently have plans to add new bike stations to infill areas with high ridership, to support the fringe of the network and to expand the network’s coverage by adding stations in farther neighborhoods. In addition to optimizing for ridership and revenue, Indego wants to ensure that the bike share network remains as accessible and equitable as possible. To support OTIS and BTS in their efforts, this project conducts an analysis of Philadelphia’s mobility, socio-economic and infrastructural characteristics using a fishnet grid system. More importantly, this study builds a robust model to predict weekly ridership for each 1,000 sq-ft grid cell across the city and evaluates the level of walkability to the network as well as the racial makeup of those with access to Indego’s network. Our final deliverable is Station Planner, a web application that allows Indego to test station scenarios by manipulating the location of future bike stations, while keeping track of ridership, accessibility, and metrics of racial equity.

Start planning with Station Planner

Evaluating and Forecasting for Philadelphia’s Bike Share Expansion

Rebekah Adams, Minwook Kang, Aidan Rhianne

2023-04-30