6 min read

Exploring Healthy Ride Pittsburgh Data With R

Healthy Ride is a bike rental service in Pittsburgh. The WPRDC hosts data about their service. We can use the data to get an idea of how Pittsburghers use the service.

In my analysis, I use R and tidyverse, lubridate, and some other packages.

First, load the packages we will be using:

library(tidyverse)
library(lubridate)
library(viridis)
library(scales)

Set the theme to format the plots:

theme_set(theme_bw())

Load the data from the WPRDC (hosted on my GitHub page):

data <- read_csv("https://raw.githubusercontent.com/conorotompkins/healthy_ride/master/data/data.csv")
## Parsed with column specification:
## cols(
##   `Trip id` = col_integer(),
##   Starttime = col_character(),
##   Stoptime = col_character(),
##   Bikeid = col_integer(),
##   Tripduration = col_integer(),
##   `From station id` = col_integer(),
##   `From station name` = col_character(),
##   `To station id` = col_integer(),
##   `To station name` = col_character(),
##   Usertype = col_character()
## )

First, we need to format the data and the column names to make them more useful for analysis:

colnames(data) <- tolower(colnames(data))
colnames(data) <- gsub(" ", "_", colnames(data))

Change to snake_case and drop old columns:

data_long <- data %>% 
  rename(start_date_time = starttime,
         stop_date_time = stoptime)

Make the dataframe longer:

data_long <- data_long %>% 
  gather(date_time_type, date_time, c(start_date_time, stop_date_time)) %>% 
  select(date_time_type, date_time, everything())

Format the date and time columns using lubridate:

data_long <- data_long %>% 
  mutate(date_time_2 = date_time) %>% 
  separate(date_time, " ", into = c("date", "time")) %>% 
  mutate(id = row_number(),
         date = mdy(date),
         year = year(date),
         month = month(date, label = TRUE),
         week = week(date),
         time = hm(time),
         hour = hour(time),
         wday = wday(date, label = TRUE),
         is_weekday = ifelse(wday %in% c("Mon", "Tues", "Wed", "Thurs", "Fri"), "weekday", "weekend"),
         yday = yday(date),
         mday = mday(date)) %>% 
  mutate(trip_duration = (tripduration / 60) / 60)

Make the dataframe longer:

data_long <- data_long %>% 
  gather(station_id_type, station_id, c(from_station_id, to_station_id)) %>% 
  gather(station_name_type, station_name, c(from_station_name, to_station_name))

Reorder the columns:

data_long <- data_long %>% 
  select(date_time_type, 
         is_weekday, 
         date, 
         year,
         month,
         hour,
         wday,
         yday,
         mday,
         date_time_2, 
         station_id_type, 
         station_id, 
         station_name_type,
         station_name,
         everything(),
         -time)

Let’s look at how the data look:

data_long
## # A tibble: 1,364,248 x 21
##    date_time_type is_weekday date        year month  hour wday   yday  mday
##    <chr>          <chr>      <date>     <dbl> <ord> <dbl> <ord> <dbl> <int>
##  1 start_date_ti~ weekend    2015-05-31  2015 May       6 Sun     151    31
##  2 start_date_ti~ weekend    2015-05-31  2015 May       6 Sun     151    31
##  3 start_date_ti~ weekend    2015-05-31  2015 May       6 Sun     151    31
##  4 start_date_ti~ weekend    2015-05-31  2015 May       7 Sun     151    31
##  5 start_date_ti~ weekend    2015-05-31  2015 May       7 Sun     151    31
##  6 start_date_ti~ weekend    2015-05-31  2015 May       7 Sun     151    31
##  7 start_date_ti~ weekend    2015-05-31  2015 May       7 Sun     151    31
##  8 start_date_ti~ weekend    2015-05-31  2015 May       7 Sun     151    31
##  9 start_date_ti~ weekend    2015-05-31  2015 May       7 Sun     151    31
## 10 start_date_ti~ weekend    2015-05-31  2015 May       7 Sun     151    31
## # ... with 1,364,238 more rows, and 12 more variables: date_time_2 <chr>,
## #   station_id_type <chr>, station_id <int>, station_name_type <chr>,
## #   station_name <chr>, trip_id <int>, bikeid <int>, tripduration <int>,
## #   usertype <chr>, id <int>, week <dbl>, trip_duration <dbl>

How has the use of the service trended in the long term?

data_long %>% 
  filter(date_time_type == "start_date_time") %>% 
  ggplot(aes(date)) +
  geom_freqpoly(stat = "density") +
  scale_x_date(date_labels = "%b %Y",
               date_breaks = "3 months") +
  labs(title = "Healthy Ride Pittsburgh",
       x = "Date",
       y = "Density of rides",
       caption = "@conor_tompkins")

A cumulative look at the number of rides:

data_long %>% 
  filter(station_name_type == "from_station_name") %>% 
  arrange(date) %>% 
  count(date) %>% 
  mutate(cum_sum = cumsum(n)) %>% 
  ggplot(aes(date, cum_sum)) +
  geom_line() +
    scale_x_date(date_labels = "%b %Y",
               date_breaks = "3 months") +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Healthy Ride Pittsburgh",
      x = "Date",
      y = "Cumulative sum of rides",
      caption = "@conor_tompkins")

A yearly look at the data shows that ridership has dropped off in 2017:

data_long %>%
  select(year, yday) %>% 
  mutate(year = as.factor(year)) %>% 
  group_by(year, yday) %>% 
  count() %>% 
  ungroup() %>% 
  group_by(year) %>% 
  mutate(cum_sum = cumsum(n)) %>% 
  ggplot(aes(yday, cum_sum,
             color = year,
             group = year)) +
  geom_line() +
  scale_y_continuous(labels = scales::comma) +
  guides(color = guide_legend(title = "Year")) +
  labs(title = "Healthy Ride Pittsburgh",
      x = "Day of the year",
      y = "Cumulative sum of rides",
      caption = "@conor_tompkins") 

How are the number of rides distributed throughout the day?

data_long %>% 
  filter(date_time_type == "start_date_time") %>% 
  ggplot(aes(hour)) +
  geom_freqpoly(stat = "count") +
  scale_x_continuous(breaks = seq(0, 23, by = 2)) +
  labs(title = "Healthy Ride Pittsburgh",
      x = "Hour",
      y = "Count of rides",
      caption = "@conor_tompkins") 

Ridership peaks during evening rush-hour.

The trend is different for weekdays vs. weekends:

data_long %>% 
  filter(date_time_type == "start_date_time") %>% 
  ggplot(aes(hour, color = is_weekday)) +
  geom_freqpoly(stat = "count") +
  scale_x_continuous(breaks = seq(0, 23, by = 2)) +
  scale_y_continuous(labels = scales::comma) +
  guides(color = guide_legend(title = "Type of day")) +
  labs(title = "Healthy Ride Pittsburgh",
    x = "Hour",
    y = "Count of rides",
    caption = "@conor_tompkins")

Breaking it down by weekday, people appear to take rides during their lunch break on Fridays. Saturdays at around 1PM see peak usage.

data_long %>%
  filter(date_time_type == "start_date_time") %>% 
  ggplot(aes(hour, color = wday)) +
  geom_freqpoly(stat = "count") +
  scale_x_continuous(breaks = seq(0, 23, by = 2)) +
  scale_y_continuous(labels = scales::comma) +
  guides(color = guide_legend(title = "Day of the week")) +
  labs(title = "Healthy Ride Pittsburgh",
    x = "Hour",
    y = "Count of rides",
    caption = "@conor_tompkins")

We can also zoom out to look at the data by year, month, day of the month, and hour. First, create a new dataframe and use complete() and replace_na() to fill out the dataframe:

#this make take a minute to run
df_tile <- data_long %>% 
  select(year, month, mday, hour) %>% 
  group_by(year, month, mday, hour) %>% 
  summarize(n = n()) %>% 
  complete(year, month, mday = 1:31, hour = 0:23) %>% 
  replace_na(replace = list(n = 0))

This code creates the tile plot:

#this make take a minute to run
df_tile %>% 
  ggplot(aes(mday, hour, fill = n)) +
  geom_tile() +
  scale_y_reverse(expand = c(0,0), 
                  breaks = c(0, 8, 16)) +
  scale_x_continuous(expand = c(0,0),
                     breaks = c(1, 14, 28)) +
  facet_grid(year ~ month) +
  coord_equal() +
  scale_fill_viridis() +
  guides(fill = guide_colorbar(title = "Number of rides")) +
  labs(title = "Healthy Ride Pittsburgh",
    x = "Day of month",
    y = "Hour",
    caption = "@conor_tompkins") +
  theme(strip.text.y = element_text(angle = 0),
        axis.text = element_text(size = 6))

We can also look at the the sum of hours biked across the same date and time variables:

#this make take a minute to run
df_tile2 <- data_long %>%
  select(year, month, mday, hour, trip_duration) %>% 
  group_by(year, month, mday, hour) %>% 
  summarize(trip_duration_sum = sum(trip_duration)) %>%
  complete(year, month, mday = 1:31, hour = 0:23) %>% 
  replace_na(replace = list(trip_duration_sum = 0))
#this make take a minute to run
df_tile2 %>% 
  ggplot(aes(mday, hour, fill = trip_duration_sum)) +
  geom_tile() +
  scale_y_reverse(expand = c(0,0), 
                  breaks = c(0, 8, 16)) +
  scale_x_continuous(expand = c(0,0),
                     breaks = c(1, 14, 28)) +
  facet_grid(year ~ month) +
  coord_equal() +
  scale_fill_viridis() +
  guides(fill = guide_colorbar(title = "Hours ridden")) +
  labs(title = "Healthy Ride Pittsburgh",
    x = "Day of month",
    y = "Hour",
    caption = "@conor_tompkins") +
  theme(strip.text.y = element_text(angle = 0),
        axis.text = element_text(size = 6))