Mini Project 1

Introduction

This mini project walks you through how to prepare, clean, and analyze data. The project will be followed by a report on the subject analyzed.

We are going to analyze the fiscal characteristics of major US public transit systems using data from the National Transit Database. The goal is to understand farebox recovery rates, analyze ridership, and examine the financial efficiency of various transit systems.

We will use data from the National Transit Database as our primary source. In particular, since we want to analyze farebox revenues, total number of trips, total number of vehicle miles traveled, and total revenues and expenses by source, we will need to analyze several different tables:

The 2022 Fare Revenue table
The latest Monthly Ridership tables
The 2022 Operating Expenses reports

Let’s start by cleaning the data.

Data Preparation

The following code will download, clean, and join the tables.

# Load required libraries
if(!require("tidyverse")) install.packages("tidyverse")

Loading required package: tidyverse

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

if(!require("readxl")) install.packages("readxl")

Loading required package: readxl

if(!require("DT")) install.packages("DT")

Loading required package: DT

library(tidyverse)
library(readxl)
library(readr)
library(lubridate)
library(DT)

# Let's start with Fare Revenue
library(tidyverse)
if(!file.exists("2022_fare_revenue.xlsx")){
    # This should work _in theory_ but in practice it's still a bit finicky
    # If it doesn't work for you, download this file 'by hand' in your
    # browser and save it as "2022_fare_revenue.xlsx" in your project
    # directory.
    download.file("http://www.transit.dot.gov/sites/fta.dot.gov/files/2024-04/2022%20Fare%20Revenue.xlsx", 
                  destfile="2022_fare_revenue.xlsx", 
                  quiet=FALSE, 
                  method="wget")
}
FARES <- readxl::read_xlsx("2022_fare_revenue.xlsx") |>
    select(-`State/Parent NTD ID`, 
           -`Reporter Type`,
           -`Reporting Module`,
           -`TOS`,
           -`Passenger Paid Fares`,
           -`Organization Paid Fares`) |>
    filter(`Expense Type` == "Funds Earned During Period") |>
    select(-`Expense Type`) |>
    group_by(`NTD ID`,       # Sum over different `TOS` for the same `Mode`
             `Agency Name`,  # These are direct operated and sub-contracted 
             `Mode`) |>      # of the same transit modality
                             # Not a big effect in most munis (significant DO
                             # tends to get rid of sub-contractors), but we'll sum
                             # to unify different passenger experiences
    summarize(`Total Fares` = sum(`Total Fares`)) |>
    ungroup()

`summarise()` has grouped output by 'NTD ID', 'Agency Name'. You can override
using the `.groups` argument.

# Next, expenses
if(!file.exists("2022_expenses.csv")){
    # This should work _in theory_ but in practice it's still a bit finicky
    # If it doesn't work for you, download this file 'by hand' in your
    # browser and save it as "2022_expenses.csv" in your project
    # directory.
    download.file("https://data.transportation.gov/api/views/dkxx-zjd6/rows.csv?date=20231102&accessType=DOWNLOAD&bom=true&format=true", 
                  destfile="2022_expenses.csv", 
                  quiet=FALSE, 
                  method="wget")
}
EXPENSES <- readr::read_csv("2022_expenses.csv") |>
    select(`NTD ID`, 
           `Agency`,
           `Total`, 
           `Mode`) |>
    mutate(`NTD ID` = as.integer(`NTD ID`)) |>
    rename(Expenses = Total) |>
    group_by(`NTD ID`, `Mode`) |>
    summarize(Expenses = sum(Expenses)) |>
    ungroup()

Rows: 3744 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Agency, City, State, NTD ID, Organization Type, Reporter Type, UZA...
dbl  (2): Report Year, UACE Code
num (10): Primary UZA Population, Agency VOMS, Mode VOMS, Vehicle Operations...
lgl  (7): Vehicle Operations Questionable, Vehicle Maintenance Questionable,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
`summarise()` has grouped output by 'NTD ID'. You can override using the `.groups` argument.

FINANCIALS <- inner_join(FARES, EXPENSES, join_by(`NTD ID`, `Mode`))

Finally, let’s extract monthly transit numbers:

# Monthly Transit Numbers
library(tidyverse)
if(!file.exists("ridership.xlsx")){
    # This should work _in theory_ but in practice it's still a bit finicky
    # If it doesn't work for you, download this file 'by hand' in your
    # browser and save it as "ridership.xlsx" in your project
    # directory.
    download.file("https://www.transit.dot.gov/sites/fta.dot.gov/files/2024-09/July%202024%20Complete%20Monthly%20Ridership%20%28with%20adjustments%20and%20estimates%29_240903.xlsx", 
                  destfile="ridership.xlsx", 
                  quiet=FALSE, 
                  method="wget")
}
TRIPS <- readxl::read_xlsx("ridership.xlsx", sheet="UPT") |>
            filter(`Mode/Type of Service Status` == "Active") |>
            select(-`Legacy NTD ID`, 
                   -`Reporter Type`, 
                   -`Mode/Type of Service Status`, 
                   -`UACE CD`, 
                   -`TOS`) |>
            pivot_longer(-c(`NTD ID`:`3 Mode`), 
                            names_to="month", 
                            values_to="UPT") |>
            drop_na() |>
            mutate(month=my(month)) # Parse _m_onth _y_ear date specs
MILES <- readxl::read_xlsx("ridership.xlsx", sheet="VRM") |>
            filter(`Mode/Type of Service Status` == "Active") |>
            select(-`Legacy NTD ID`, 
                   -`Reporter Type`, 
                   -`Mode/Type of Service Status`, 
                   -`UACE CD`, 
                   -`TOS`) |>
            pivot_longer(-c(`NTD ID`:`3 Mode`), 
                            names_to="month", 
                            values_to="VRM") |>
            drop_na() |>
            group_by(`NTD ID`, `Agency`, `UZA Name`, 
                     `Mode`, `3 Mode`, month) |>
            summarize(VRM = sum(VRM)) |>
            ungroup() |>
            mutate(month=my(month)) # Parse _m_onth _y_ear date specs

`summarise()` has grouped output by 'NTD ID', 'Agency', 'UZA Name', 'Mode', '3
Mode'. You can override using the `.groups` argument.

USAGE <- inner_join(TRIPS, MILES) |>
    mutate(`NTD ID` = as.integer(`NTD ID`))

Joining with `by = join_by(`NTD ID`, Agency, `UZA Name`, Mode, `3 Mode`,
month)`

This creates a table as follows:

if(!require("DT")) install.packages("DT")
library(DT)

sample_n(USAGE, 1000) |> 
    mutate(month=as.character(month)) |> 
    DT::datatable()

Task 1 - Creating Syntatic Names

The following code will rename the column ‘UZA Name’ to ‘metro_area’. Because it has no spaces in it, this name will be easier to manipulate in code.

USAGE <- USAGE |>
  rename(metro_area = `UZA Name`)

Task 2: Recoding the Mode column

The ‘Mode’ column is also helpful, but it uses a set of codes that aren’t interpretable. To make life easier for ourselves, let’s use a case_when statement to transform this into something we can make sense of.

First, let’s find the unique ‘Mode’ codes in our data using the distinct function. After examining the NTD website and finding the interpretations of these codes, we can complete the following snippet to recode the ‘Mode’ column.

# Find unique Mode codes in the USAGE table
distinct(USAGE, Mode)

# A tibble: 18 × 1
   Mode 
   <chr>
 1 DR   
 2 FB   
 3 MB   
 4 SR   
 5 TB   
 6 VP   
 7 CB   
 8 RB   
 9 LR   
10 YR   
11 MG   
12 CR   
13 AR   
14 TR   
15 HR   
16 IP   
17 PB   
18 CC

USAGE <- USAGE |>
  mutate(Mode = case_when(
    Mode == "HR" ~ "Heavy Rail",       # HR: Heavy Rail
    Mode == "LR" ~ "Light Rail",        # LR: Light Rail
    Mode == "MB" ~ "Bus",               # MB: Bus (Motor Bus)
    Mode == "CR" ~ "Commuter Rail",     # CR: Commuter Rail
    Mode == "DR" ~ "Demand Response",   # DR: Demand Response
    Mode == "VP" ~ "Vanpool",           # VP: Vanpool
    Mode == "AR" ~ "Alaska Railroad",   # AR: Alaska Railroad
    Mode == "RB" ~ "Bus Rapid Transit", # RB: Bus Rapid Transit
    Mode == "FB" ~ "Ferryboat",       # FB: Ferryboat
    Mode == "SR" ~ "Streetcar Rail",       # SR: Streetcar Rail
    Mode == "TB" ~ "Trolleybus",       # TB: Trolleybus
    Mode == "CB" ~ "Commuter Bus",       # CB: Commuter Bus
    Mode == "YR" ~ "Hybrid Rail",       # YR: Hybrid Rail
    Mode == "MG" ~ "Monorail and Automated Guideway modes",  # MG: Monorail and Automated Guideway modes
    Mode == "TR" ~ "Aerial Tramway",       # TR: Aerial Tramway
    Mode == "IP" ~ "Inclined Plane",       # IP: Inclined Plane
    Mode == "PB" ~ "Publico",       # PB: Publico
    Mode == "CC" ~ "Cable Car",       # CC: Cable Car
    TRUE ~ "Unknown"                    # Any other mode
  ))

Now that the data is clean, we can create an attractive summary table of the cleaned up USAGE table using the following snippet:

if(!require("DT")) install.packages("DT")
library(DT)

sample_n(USAGE, 1000) |> 
    mutate(month=as.character(month)) |> 
    DT::datatable()

Data Analysis

Now let’s analyse our data.

Task 3: Answering Specified Questions with dplyr

Using functions filter, group_by, summarize, arrange, we are going to answer the following questions for our analysis with the following codes:

What transit agency had the most total VRM in our data set?

# Load necessary libraries
library(dplyr)
library(knitr)
# Calculate total VRM by agency and find the agency with the most total VRM
agency_most_vrm <- USAGE |>
  group_by(Agency) |>
  summarize(total_VRM = sum(VRM, na.rm = TRUE)) |>
  arrange(desc(total_VRM)) |>
  slice(1)  # Get the top agency
# Display the result in a table
kable(agency_most_vrm, caption = "Transit Agency with Most Total VRM")

Transit Agency with Most Total VRM
Agency	total_VRM
MTA New York City Transit	10832855350

The transit agency that had the most total VRM in our data set is the MTA New York City Transit, which reported a total of 10,832,855,350 VRM.

What transit mode had the most total VRM in our data set?

mode_most_vrm <- USAGE |>
  group_by(Mode) |>
  summarize(total_VRM = sum(VRM, na.rm = TRUE)) |>
  arrange(desc(total_VRM)) |>
  slice(1)  # Get the top mode

kable(mode_most_vrm, caption = "Transit Mode with Most Total VRM")

Transit Mode with Most Total VRM
Mode	total_VRM
Bus	49444494088

The transit mode that had the most total VRM in our data set is the Bus with a total of 49,444,494,088 VRM.

How many trips were taken on the NYC Subway (Heavy Rail) in May 2024?

# Filter for Heavy Rail in NYC for May 2024
NYC_Subway_trip_may_24 <- USAGE %>%
  mutate(month = format(ymd(month), "%Y-%m")) %>% 
  filter(Mode == "Heavy Rail", 
         Agency == "MTA New York City Transit",  
         month == "2024-05") %>%  
  summarize(total_trip = sum(UPT, na.rm = TRUE)) 

kable(NYC_Subway_trip_may_24, caption = "Total Heavy Rail Trips in NYC for May 2024")

Total Heavy Rail Trips in NYC for May 2024
total_trip
180458819

In May 2024, 180,458,819 trips were taken on the NYC Subway (Heavy Rail).

How much did NYC subway ridership fall between April 2019 and April 2020?

# NYC Heavy Rail ridership for April 2019
nyc_ridership_2019 <- USAGE |>
  filter(metro_area == "New York--Jersey City--Newark, NY--NJ", Mode == "Heavy Rail", month == "2019-04-01") |>
  summarise(nyc_sub_2019 = sum(UPT, na.rm = TRUE))

# NYC Heavy Rail ridership for April 2020
nyc_ridership_2020 <- USAGE |>
  filter(metro_area == "New York--Jersey City--Newark, NY--NJ", Mode == "Heavy Rail", month == "2020-04-01") |>
  summarise(nyc_subway_ridership_fall = sum(UPT, na.rm = TRUE))

# Calculate the percentage change in ridership
nyc_ridership <- ((nyc_ridership_2020$nyc_subway_ridership_fall - nyc_ridership_2019$nyc_sub_2019) / nyc_ridership_2019$nyc_sub_2019) * 100

# Combine results into a data frame for display
results <- data.frame(
  Year = c("April 2019", "April 2020", "Percentage Change"),
  UPT = c(nyc_ridership_2019$nyc_sub_2019, nyc_ridership_2020$nyc_subway_ridership_fall, nyc_ridership)
)

kable(results, caption = "NYC Heavy Rail Ridership Comparison for April 2019 and April 2020")

NYC Heavy Rail Ridership Comparison for April 2019 and April 2020
Year	UPT
April 2019	2.406554e+08
April 2020	2.070482e+07
Percentage Change	-9.139649e+01

Here our results show us that the NYC Subway ridership went from 240,655,437 in April 2019 to 20,704,824 in April 2020.Our last piece of the code informs us that this represent a 91% ridership fall between April 2019 and April 2020.

Task 4: Explore and Analyze

Find the city with the most trips in a single month.

# Calculate the highest total UPT by metro area and month
highest_upt <- USAGE |> 
  group_by(metro_area, month) |> 
  summarize(total_UPT = sum(UPT, na.rm = TRUE), .groups = 'drop') |> 
  slice_max(total_UPT, n = 1)  # Get the highest total UPT for each metro area and month

kable(highest_upt, caption = "Metro Area and Month with the Highest Total UPT")

Metro Area and Month with the Highest Total UPT
metro_area	month	total_UPT
New York–Jersey City–Newark, NY–NJ	2014-10-01	396079939

Which agency has the most extensive bus network based on VRM?

# Calculate the agency with the highest total VRM for Motor Bus
highest_vrm_motor_bus <- USAGE |> 
  filter(Mode == "Bus") |>
  group_by(Agency) |>
  summarize(total_VRM = sum(VRM, na.rm = TRUE)) |>
  arrange(desc(total_VRM)) |>
  slice(1)

kable(highest_vrm_motor_bus, caption = "Agency with the Highest Total VRM for Motor Bus")

Agency with the Highest Total VRM for Motor Bus
Agency	total_VRM
New Jersey Transit Corporation	3781858802

How did overall ridership change over time from 2022 to 2024?

# Calculate the total UPT by year
annual_upt <- USAGE |> 
  group_by(year = year(month)) |> 
  summarize(total_UPT = sum(UPT, na.rm = TRUE)) |> 
  arrange(year)

kable(annual_upt, caption = "Total UPT by Year")

Total UPT by Year
year	total_UPT
2002	8416364775
2003	8260528905
2004	8631243464
2005	8944272233
2006	9286575204
2007	9473412104
2008	9888947626
2009	9544171711
2010	9541689057
2011	9732100704
2012	10014964538
2013	10148119047
2014	10281027641
2015	10069594700
2016	9884467841
2017	9664591370
2018	9570957943
2019	9630231947
2020	4488461287
2021	4668753846
2022	6052448631
2023	6972821886
2024	4181081901

Task 5: Table Summarization

Create a new table from USAGE that has annual total (sum) UPT and VRM for 2022. This will require use of the group_by, summarize, and filter functions. You will also want to use the year function, to extract a year from the month column.

The resulting table should have the following columns:

NTD ID Agency metro_area Mode UPT VRM Make sure to ungroup your table after creating it.

We will name this table USAGE_2022_ANNUAL.

USAGE_2022_ANNUAL <- USAGE |>
  filter(year(month) == 2022) |>
  group_by(`NTD ID`, Agency, `metro_area`, Mode) |>
  summarize(
    UPT = sum(UPT, na.rm = TRUE),
    VRM = sum(VRM, na.rm = TRUE)
  ) %>%
  ungroup()

`summarise()` has grouped output by 'NTD ID', 'Agency', 'metro_area'. You can
override using the `.groups` argument.

USAGE_2022_ANNUAL

# A tibble: 1,141 × 6
   `NTD ID` Agency                                metro_area Mode     UPT    VRM
      <int> <chr>                                 <chr>      <chr>  <dbl>  <dbl>
 1        1 King County                           Seattle--… Bus   5.40e7 6.16e7
 2        1 King County                           Seattle--… Dema… 6.63e5 1.29e7
 3        1 King County                           Seattle--… Ferr… 4.00e5 5.12e4
 4        1 King County                           Seattle--… Stre… 1.12e6 1.80e5
 5        1 King County                           Seattle--… Trol… 9.58e6 2.64e6
 6        1 King County                           Seattle--… Vanp… 7.03e5 4.41e6
 7        2 Spokane Transit Authority             Spokane, … Bus   6.60e6 6.49e6
 8        2 Spokane Transit Authority             Spokane, … Dema… 3.10e5 4.04e6
 9        2 Spokane Transit Authority             Spokane, … Vanp… 9.06e4 9.06e5
10        3 Pierce County Transportation Benefit… Seattle--… Bus   4.95e6 4.23e6
# ℹ 1,131 more rows

Once we have created this new table, we can merge it to the FINANCIALS data but first we need to make sure they have the same “mode” refrence names with the following code:

FINANCIALS <- FINANCIALS |>
  mutate(Mode = case_when(
    Mode == "HR" ~ "Heavy Rail",       # HR: Heavy Rail
    Mode == "LR" ~ "Light Rail",        # LR: Light Rail
    Mode == "MB" ~ "Bus",               # MB: Bus (Motor Bus)
    Mode == "CR" ~ "Commuter Rail",     # CR: Commuter Rail
    Mode == "DR" ~ "Demand Response",   # DR: Demand Response
    Mode == "VP" ~ "Vanpool",           # VP: Vanpool
    Mode == "AR" ~ "Alaska Railroad",   # AR: Alaska Railroad
    Mode == "RB" ~ "Bus Rapid Transit", # RB: Bus Rapid Transit
    Mode == "FB" ~ "Ferryboat",       # FB: Ferryboat
    Mode == "SR" ~ "Streetcar Rail",       # SR: Streetcar Rail
    Mode == "TB" ~ "Trolleybus",       # TB: Trolleybus
    Mode == "CB" ~ "Commuter Bus",       # CB: Commuter Bus
    Mode == "YR" ~ "Hybrid Rail",       # YR: Hybrid Rail
    Mode == "MG" ~ "Monorail and Automated Guideway modes",  # MG: Monorail and Automated Guideway modes
    Mode == "TR" ~ "Aerial Tramway",       # TR: Aerial Tramway
    Mode == "IP" ~ "Inclined Plane",       # IP: Inclined Plane
    Mode == "PB" ~ "Publico",       # PB: Publico
    Mode == "CC" ~ "Cable Car",       # CC: Cable Car
    TRUE ~ "Unknown"                    # Any other mode
  ))

Once we have made sure that USAGE_2022_ANNUAL and FINANCIALS have the same value for “mode”, we can merge USAGE_2022_ANNUAL to the FINANCIALS data as follows:

USAGE_AND_FINANCIALS <- left_join(USAGE_2022_ANNUAL, 
           FINANCIALS, 
           join_by(`NTD ID`, Mode)) |>
    drop_na()
USAGE_AND_FINANCIALS |>
      DT::datatable()

Task 6: Farebox Recovery Among Major Systems

Using the USAGE_AND_FINANCIALS table, we are going to try to answer the following questions:

Which transit system (agency and mode) had the most UPT in 2022?
Which transit system (agency and mode) had the highest farebox recovery, defined as the highest ratio of - Total Fares to Expenses?
Which transit system (agency and mode) has the lowest expenses per UPT?
Which transit system (agency and mode) has the highest total fares per UPT?
Which transit system (agency and mode) has the lowest expenses per VRM?
Which transit system (agency and mode) has the highest total fares per VRM?

We will restrict our answers to major transit systems by defining them as those with 400,000 UPT per annum.

To answer these questions, we’ll need to perform the following steps using the USAGE_AND_FINANCIALS table:

Which transit system (agency and mode) had the most UPT in 2022?

# Find the transit system with the most UPT in 2022
library(dplyr)
library(knitr)
max_upt <- USAGE_AND_FINANCIALS %>%
  filter(UPT >= 400000) %>%
  arrange(desc(UPT)) %>%
  slice(1) %>%
  select(Agency, Mode, UPT)
kable(max_upt, caption = "Transit System with Most UPT in 2022")

Transit System with Most UPT in 2022
Agency	Mode	UPT
MTA New York City Transit	Heavy Rail	1793073801

The answer shows us that the MTA New York City Transit is the transit system that had the most UTP with more than 1.79 billion of trips.

Which transit system (agency and mode) had the highest farebox recovery (Total Fares to Expenses ratio)?

highest_farebox_recovery <- USAGE_AND_FINANCIALS |>
  group_by(Agency, Mode) |>
  summarize(
    total_fares = sum(`Total Fares`, na.rm = TRUE),
    total_expenses = sum(Expenses, na.rm = TRUE)
  ) |>
  mutate(farebox_recovery_ratio = total_fares / total_expenses) |>
  ungroup() |>
  arrange(desc(farebox_recovery_ratio)) |>
  slice_head(n = 1)

`summarise()` has grouped output by 'Agency'. You can override using the
`.groups` argument.

kable(highest_farebox_recovery, caption = "Transit System with Highest Farebox Recovery Ratio")

Transit System with Highest Farebox Recovery Ratio
Agency	Mode	total_fares	total_expenses	farebox_recovery_ratio
Transit Authority of Central Kentucky	Vanpool	97300	40801	2.384746

The Transit System with the highest recovery ratio is the Transit Authority of Central Kentuchy, with the Vanpool mode and a recovery ratio of $2.38 USD.

3.Which transit system (agency and mode) has the lowest expenses per UPT?

lowest_expenses_per_upt <- USAGE_AND_FINANCIALS |>
  group_by(Agency, Mode) |>
  summarize(
    total_expenses = sum(Expenses, na.rm = TRUE),
    total_UPT = sum(UPT, na.rm = TRUE)
  ) |>
  filter(total_UPT >= 400000) |>
  mutate(expenses_per_UPT = total_expenses / total_UPT) |>
  ungroup() |>
  arrange(expenses_per_UPT) |>
  slice_head(n = 1)

`summarise()` has grouped output by 'Agency'. You can override using the
`.groups` argument.

kable(lowest_expenses_per_upt, caption = "Transit System with Lowest Expenses per UPT")

Transit System with Lowest Expenses per UPT
Agency	Mode	total_expenses	total_UPT	expenses_per_UPT
North Carolina State University	Bus	2727412	2313091	1.17912

North Carolina State University is the agency with the lowest expenses per UPT with the Bus mode. The expenses per UPT is $1.18 USD.

4.Which transit system (agency and mode) has the highest total fares per UPT?

highest_fares_per_upt <- USAGE_AND_FINANCIALS |>
  group_by(Agency, Mode) |>
  summarize(
    total_fares1 = sum(`Total Fares`, na.rm = TRUE),
    total_UPT = sum(UPT, na.rm = TRUE)
  ) |>
  mutate(total_fares_per_UPT = total_fares1 / total_UPT) |>
  ungroup() |>
  arrange(desc(total_fares_per_UPT)) |>
  slice_head(n = 1)

`summarise()` has grouped output by 'Agency'. You can override using the
`.groups` argument.

kable(highest_fares_per_upt, caption = "Transit System with Highest Total Fares per UPT")

Transit System with Highest Total Fares per UPT
Agency	Mode	total_fares1	total_UPT	total_fares_per_UPT
Altoona Metro Transit	Demand Response	17163	26	660.1154

The Altoona Metro Transit with the Demand Response mode had the highest total fares per UPT. With just 26 trips, it collected a total in fares of $17,058 USD with a fare of $660 USD per UPT.

5.Which transit system (agency and mode) has the lowest expenses per VRM?

lowest_expenses_per_vrm <- USAGE_AND_FINANCIALS %>%
  filter(UPT >= 400000) %>%
  mutate(expenses_per_vrm = Expenses / VRM) %>%
  arrange(expenses_per_vrm) %>%
  slice(1) %>%
  select(Agency, Mode, expenses_per_vrm)

kable(lowest_expenses_per_vrm, caption = "Transit System with Lowest Expenses per VRM in 2022")

Transit System with Lowest Expenses per VRM in 2022
Agency	Mode	expenses_per_vrm
Metropolitan Transportation Commission	Vanpool	0.4449998

New Mexico Department of Transportation Vanpool mode has the lowest expense per VRM. Its fare per VRM is around 0.45.

6.Which transit system (agency and mode) has the highest total fares per VRM?

highest_fares_per_VRM <- USAGE_AND_FINANCIALS %>%
  mutate(fares_per_VRM = `Total Fares` / VRM)  %>%  
  arrange(desc(fares_per_VRM)) %>%  
  slice_max(fares_per_VRM, n = 1) 

kable(highest_fares_per_VRM, caption = "Transit System with Highest Fares per VRM in 2022")

Transit System with Highest Fares per VRM in 2022
NTD ID	Agency	metro_area	Mode	UPT	VRM	Agency Name	Total Fares	Expenses	fares_per_VRM
50521	Chicago Water Taxi (Wendella)	Chicago, IL–IN	Ferryboat	16936	600	Chicago Water Taxi (Wendella)	142473	211296	237.455

The transit system that has the highest total fares per VRM is the Ferryboat from Chicago Water Taxi (Wendella) with a total fare per VRM of $237 USD.

Conclusion from the Mini Project 1 Analysis

This mini project provides several key insights into the fiscal characteristics, ridership, and operational efficiency of major U.S. public transit systems in 2022, based on data from the National Transit Database. Below are the major conclusions derived from each analysis:

Most Total Vehicle Revenue Miles (VRM)

Transit Agency: The MTA New York City Transit had the highest total Vehicle Revenue Miles (VRM) with over 10.8 billion VRM. Transit Mode: Buses had the most total VRM, accumulating more than 49.4 billion VRM across agencies. This indicates that the MTA operates the largest transit network in terms of vehicle miles, reflecting its role in serving a massive, densely populated area like New York City. Buses are crucial across transit systems, covering extensive distances.

Ridership Insights

NYC Subway Ridership: In May 2024, there were approximately 180 million trips taken on the NYC Subway (Heavy Rail), showing the continued importance of the subway in daily transportation. Ridership Decline: NYC Subway ridership fell drastically between April 2019 and April 2020 by 91%, primarily due to the COVID-19 pandemic, reflecting the severe impact of external shocks like pandemics on public transit usage. This emphasizes how ridership levels can be highly volatile in response to global events, such as health crises, with significant consequences for transit agencies’ revenue and service planning.

Highest Ridership in a Single Month

Metro Area: New York–Jersey City–Newark, NY–NJ had the highest number of trips in a single month (October 2014), with nearly 396 million unlinked passenger trips (UPT). This reinforces New York City’s dominance as the leading metropolitan area in the U.S. in terms of public transit usage, particularly for heavy rail and bus systems.

Bus Network Extent

Agency with Most Extensive Bus Network: New Jersey Transit Corporation had the most extensive bus network based on VRM, with over 3.78 billion VRM, reflecting its widespread service area and role in connecting commuters across state lines.

Ridership Trends (2022-2024)

Ridership saw a gradual recovery post-pandemic, with total UPT increasing from 6 billion in 2022 to nearly 7 billion in 2023. However, it significantly dropped again in 2024 to just over 4.18 billion, potentially indicating further external factors or a slow post-pandemic recovery.

Annual UPT and VRM for 2022

The analysis of UPT and VRM by transit mode and metro area in 2022 highlights regional differences in transit usage and network extent. For instance, King County (Seattle) had significant bus usage and VRM.

Farebox Recovery and Financial Efficiency

Highest Farebox Recovery: The Transit Authority of Central Kentucky had the highest farebox recovery ratio at 2.38, meaning the system collected more in fares than it spent on expenses (a rare occurrence in public transit). Lowest Expenses per UPT: A yet-to-be-confirmed agency/mode from the final analysis had the lowest expenses per UPT, reflecting operational efficiency in terms of cost per passenger. These insights into farebox recovery and financial efficiency help identify which systems are more sustainable and cost-effective, and which may require additional subsidies or fare increases to maintain financial stability.

Overall Conclusion

The project underscores the critical role that public transit systems like the MTA and New Jersey Transit play in urban mobility, especially in large metropolitan areas like New York City. However, the data also reveals significant vulnerabilities, particularly in ridership declines during major events like the pandemic, which drastically impacted revenue generation.

The findings related to farebox recovery, VRM, and UPT offer insights into which transit systems are the most efficient and financially viable, with a focus on reducing operational expenses and optimizing fare collection to meet expenses.