This mini project walks you through how to prepare, clean, and analyze data. The project will be followed by a report on the subject analyzed.
We are going to analyze the fiscal characteristics of major US public transit systems using data from the National Transit Database. The goal is to understand farebox recovery rates, analyze ridership, and examine the financial efficiency of various transit systems.
We will use data from the National Transit Database as our primary source. In particular, since we want to analyze farebox revenues, total number of trips, total number of vehicle miles traveled, and total revenues and expenses by source, we will need to analyze several different tables:
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
if(!require("readxl")) install.packages("readxl")
Loading required package: readxl
if(!require("DT")) install.packages("DT")
Loading required package: DT
library(tidyverse)library(readxl)library(readr)library(lubridate)library(DT)# Let's start with Fare Revenuelibrary(tidyverse)if(!file.exists("2022_fare_revenue.xlsx")){# This should work _in theory_ but in practice it's still a bit finicky# If it doesn't work for you, download this file 'by hand' in your# browser and save it as "2022_fare_revenue.xlsx" in your project# directory.download.file("http://www.transit.dot.gov/sites/fta.dot.gov/files/2024-04/2022%20Fare%20Revenue.xlsx", destfile="2022_fare_revenue.xlsx", quiet=FALSE, method="wget")}FARES <- readxl::read_xlsx("2022_fare_revenue.xlsx") |>select(-`State/Parent NTD ID`, -`Reporter Type`,-`Reporting Module`,-`TOS`,-`Passenger Paid Fares`,-`Organization Paid Fares`) |>filter(`Expense Type`=="Funds Earned During Period") |>select(-`Expense Type`) |>group_by(`NTD ID`, # Sum over different `TOS` for the same `Mode``Agency Name`, # These are direct operated and sub-contracted `Mode`) |># of the same transit modality# Not a big effect in most munis (significant DO# tends to get rid of sub-contractors), but we'll sum# to unify different passenger experiencessummarize(`Total Fares`=sum(`Total Fares`)) |>ungroup()
`summarise()` has grouped output by 'NTD ID', 'Agency Name'. You can override
using the `.groups` argument.
# Next, expensesif(!file.exists("2022_expenses.csv")){# This should work _in theory_ but in practice it's still a bit finicky# If it doesn't work for you, download this file 'by hand' in your# browser and save it as "2022_expenses.csv" in your project# directory.download.file("https://data.transportation.gov/api/views/dkxx-zjd6/rows.csv?date=20231102&accessType=DOWNLOAD&bom=true&format=true", destfile="2022_expenses.csv", quiet=FALSE, method="wget")}EXPENSES <- readr::read_csv("2022_expenses.csv") |>select(`NTD ID`, `Agency`,`Total`, `Mode`) |>mutate(`NTD ID`=as.integer(`NTD ID`)) |>rename(Expenses = Total) |>group_by(`NTD ID`, `Mode`) |>summarize(Expenses =sum(Expenses)) |>ungroup()
Rows: 3744 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): Agency, City, State, NTD ID, Organization Type, Reporter Type, UZA...
dbl (2): Report Year, UACE Code
num (10): Primary UZA Population, Agency VOMS, Mode VOMS, Vehicle Operations...
lgl (7): Vehicle Operations Questionable, Vehicle Maintenance Questionable,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
`summarise()` has grouped output by 'NTD ID'. You can override using the `.groups` argument.
# Monthly Transit Numberslibrary(tidyverse)if(!file.exists("ridership.xlsx")){# This should work _in theory_ but in practice it's still a bit finicky# If it doesn't work for you, download this file 'by hand' in your# browser and save it as "ridership.xlsx" in your project# directory.download.file("https://www.transit.dot.gov/sites/fta.dot.gov/files/2024-09/July%202024%20Complete%20Monthly%20Ridership%20%28with%20adjustments%20and%20estimates%29_240903.xlsx", destfile="ridership.xlsx", quiet=FALSE, method="wget")}TRIPS <- readxl::read_xlsx("ridership.xlsx", sheet="UPT") |>filter(`Mode/Type of Service Status`=="Active") |>select(-`Legacy NTD ID`, -`Reporter Type`, -`Mode/Type of Service Status`, -`UACE CD`, -`TOS`) |>pivot_longer(-c(`NTD ID`:`3 Mode`), names_to="month", values_to="UPT") |>drop_na() |>mutate(month=my(month)) # Parse _m_onth _y_ear date specsMILES <- readxl::read_xlsx("ridership.xlsx", sheet="VRM") |>filter(`Mode/Type of Service Status`=="Active") |>select(-`Legacy NTD ID`, -`Reporter Type`, -`Mode/Type of Service Status`, -`UACE CD`, -`TOS`) |>pivot_longer(-c(`NTD ID`:`3 Mode`), names_to="month", values_to="VRM") |>drop_na() |>group_by(`NTD ID`, `Agency`, `UZA Name`, `Mode`, `3 Mode`, month) |>summarize(VRM =sum(VRM)) |>ungroup() |>mutate(month=my(month)) # Parse _m_onth _y_ear date specs
`summarise()` has grouped output by 'NTD ID', 'Agency', 'UZA Name', 'Mode', '3
Mode'. You can override using the `.groups` argument.
The following code will rename the column ‘UZA Name’ to ‘metro_area’. Because it has no spaces in it, this name will be easier to manipulate in code.
USAGE <- USAGE |>rename(metro_area =`UZA Name`)
Task 2: Recoding the Mode column
The ‘Mode’ column is also helpful, but it uses a set of codes that aren’t interpretable. To make life easier for ourselves, let’s use a case_when statement to transform this into something we can make sense of.
First, let’s find the unique ‘Mode’ codes in our data using the distinct function. After examining the NTD website and finding the interpretations of these codes, we can complete the following snippet to recode the ‘Mode’ column.
# Find unique Mode codes in the USAGE tabledistinct(USAGE, Mode)
# A tibble: 18 × 1
Mode
<chr>
1 DR
2 FB
3 MB
4 SR
5 TB
6 VP
7 CB
8 RB
9 LR
10 YR
11 MG
12 CR
13 AR
14 TR
15 HR
16 IP
17 PB
18 CC
Using functions filter, group_by, summarize, arrange, we are going to answer the following questions for our analysis with the following codes:
What transit agency had the most total VRM in our data set?
# Load necessary librarieslibrary(dplyr)library(knitr)# Calculate total VRM by agency and find the agency with the most total VRMagency_most_vrm <- USAGE |>group_by(Agency) |>summarize(total_VRM =sum(VRM, na.rm =TRUE)) |>arrange(desc(total_VRM)) |>slice(1) # Get the top agency# Display the result in a tablekable(agency_most_vrm, caption ="Transit Agency with Most Total VRM")
Transit Agency with Most Total VRM
Agency
total_VRM
MTA New York City Transit
10832855350
The transit agency that had the most total VRM in our data set is the MTA New York City Transit, which reported a total of 10,832,855,350 VRM.
What transit mode had the most total VRM in our data set?
mode_most_vrm <- USAGE |>group_by(Mode) |>summarize(total_VRM =sum(VRM, na.rm =TRUE)) |>arrange(desc(total_VRM)) |>slice(1) # Get the top modekable(mode_most_vrm, caption ="Transit Mode with Most Total VRM")
Transit Mode with Most Total VRM
Mode
total_VRM
Bus
49444494088
The transit mode that had the most total VRM in our data set is the Bus with a total of 49,444,494,088 VRM.
How many trips were taken on the NYC Subway (Heavy Rail) in May 2024?
# Filter for Heavy Rail in NYC for May 2024NYC_Subway_trip_may_24 <- USAGE %>%mutate(month =format(ymd(month), "%Y-%m")) %>%filter(Mode =="Heavy Rail", Agency =="MTA New York City Transit", month =="2024-05") %>%summarize(total_trip =sum(UPT, na.rm =TRUE)) kable(NYC_Subway_trip_may_24, caption ="Total Heavy Rail Trips in NYC for May 2024")
Total Heavy Rail Trips in NYC for May 2024
total_trip
180458819
In May 2024, 180,458,819 trips were taken on the NYC Subway (Heavy Rail).
How much did NYC subway ridership fall between April 2019 and April 2020?
# NYC Heavy Rail ridership for April 2019nyc_ridership_2019 <- USAGE |>filter(metro_area =="New York--Jersey City--Newark, NY--NJ", Mode =="Heavy Rail", month =="2019-04-01") |>summarise(nyc_sub_2019 =sum(UPT, na.rm =TRUE))# NYC Heavy Rail ridership for April 2020nyc_ridership_2020 <- USAGE |>filter(metro_area =="New York--Jersey City--Newark, NY--NJ", Mode =="Heavy Rail", month =="2020-04-01") |>summarise(nyc_subway_ridership_fall =sum(UPT, na.rm =TRUE))# Calculate the percentage change in ridershipnyc_ridership <- ((nyc_ridership_2020$nyc_subway_ridership_fall - nyc_ridership_2019$nyc_sub_2019) / nyc_ridership_2019$nyc_sub_2019) *100# Combine results into a data frame for displayresults <-data.frame(Year =c("April 2019", "April 2020", "Percentage Change"),UPT =c(nyc_ridership_2019$nyc_sub_2019, nyc_ridership_2020$nyc_subway_ridership_fall, nyc_ridership))kable(results, caption ="NYC Heavy Rail Ridership Comparison for April 2019 and April 2020")
NYC Heavy Rail Ridership Comparison for April 2019 and April 2020
Year
UPT
April 2019
2.406554e+08
April 2020
2.070482e+07
Percentage Change
-9.139649e+01
Here our results show us that the NYC Subway ridership went from 240,655,437 in April 2019 to 20,704,824 in April 2020.Our last piece of the code informs us that this represent a 91% ridership fall between April 2019 and April 2020.
Task 4: Explore and Analyze
Find the city with the most trips in a single month.
# Calculate the highest total UPT by metro area and monthhighest_upt <- USAGE |>group_by(metro_area, month) |>summarize(total_UPT =sum(UPT, na.rm =TRUE), .groups ='drop') |>slice_max(total_UPT, n =1) # Get the highest total UPT for each metro area and monthkable(highest_upt, caption ="Metro Area and Month with the Highest Total UPT")
Metro Area and Month with the Highest Total UPT
metro_area
month
total_UPT
New York–Jersey City–Newark, NY–NJ
2014-10-01
396079939
Which agency has the most extensive bus network based on VRM?
# Calculate the agency with the highest total VRM for Motor Bushighest_vrm_motor_bus <- USAGE |>filter(Mode =="Bus") |>group_by(Agency) |>summarize(total_VRM =sum(VRM, na.rm =TRUE)) |>arrange(desc(total_VRM)) |>slice(1)kable(highest_vrm_motor_bus, caption ="Agency with the Highest Total VRM for Motor Bus")
Agency with the Highest Total VRM for Motor Bus
Agency
total_VRM
New Jersey Transit Corporation
3781858802
How did overall ridership change over time from 2022 to 2024?
# Calculate the total UPT by yearannual_upt <- USAGE |>group_by(year =year(month)) |>summarize(total_UPT =sum(UPT, na.rm =TRUE)) |>arrange(year)kable(annual_upt, caption ="Total UPT by Year")
Total UPT by Year
year
total_UPT
2002
8416364775
2003
8260528905
2004
8631243464
2005
8944272233
2006
9286575204
2007
9473412104
2008
9888947626
2009
9544171711
2010
9541689057
2011
9732100704
2012
10014964538
2013
10148119047
2014
10281027641
2015
10069594700
2016
9884467841
2017
9664591370
2018
9570957943
2019
9630231947
2020
4488461287
2021
4668753846
2022
6052448631
2023
6972821886
2024
4181081901
Task 5: Table Summarization
Create a new table from USAGE that has annual total (sum) UPT and VRM for 2022. This will require use of the group_by, summarize, and filter functions. You will also want to use the year function, to extract a year from the month column.
The resulting table should have the following columns:
NTD ID Agency metro_area Mode UPT VRM Make sure to ungroup your table after creating it.
`summarise()` has grouped output by 'NTD ID', 'Agency', 'metro_area'. You can
override using the `.groups` argument.
USAGE_2022_ANNUAL
# A tibble: 1,141 × 6
`NTD ID` Agency metro_area Mode UPT VRM
<int> <chr> <chr> <chr> <dbl> <dbl>
1 1 King County Seattle--… Bus 5.40e7 6.16e7
2 1 King County Seattle--… Dema… 6.63e5 1.29e7
3 1 King County Seattle--… Ferr… 4.00e5 5.12e4
4 1 King County Seattle--… Stre… 1.12e6 1.80e5
5 1 King County Seattle--… Trol… 9.58e6 2.64e6
6 1 King County Seattle--… Vanp… 7.03e5 4.41e6
7 2 Spokane Transit Authority Spokane, … Bus 6.60e6 6.49e6
8 2 Spokane Transit Authority Spokane, … Dema… 3.10e5 4.04e6
9 2 Spokane Transit Authority Spokane, … Vanp… 9.06e4 9.06e5
10 3 Pierce County Transportation Benefit… Seattle--… Bus 4.95e6 4.23e6
# ℹ 1,131 more rows
Once we have created this new table, we can merge it to the FINANCIALS data but first we need to make sure they have the same “mode” refrence names with the following code:
Once we have made sure that USAGE_2022_ANNUAL and FINANCIALS have the same value for “mode”, we can merge USAGE_2022_ANNUAL to the FINANCIALS data as follows:
Using the USAGE_AND_FINANCIALS table, we are going to try to answer the following questions:
Which transit system (agency and mode) had the most UPT in 2022?
Which transit system (agency and mode) had the highest farebox recovery, defined as the highest ratio of - Total Fares to Expenses?
Which transit system (agency and mode) has the lowest expenses per UPT?
Which transit system (agency and mode) has the highest total fares per UPT?
Which transit system (agency and mode) has the lowest expenses per VRM?
Which transit system (agency and mode) has the highest total fares per VRM?
We will restrict our answers to major transit systems by defining them as those with 400,000 UPT per annum.
To answer these questions, we’ll need to perform the following steps using the USAGE_AND_FINANCIALS table:
Which transit system (agency and mode) had the most UPT in 2022?
# Find the transit system with the most UPT in 2022library(dplyr)library(knitr)max_upt <- USAGE_AND_FINANCIALS %>%filter(UPT >=400000) %>%arrange(desc(UPT)) %>%slice(1) %>%select(Agency, Mode, UPT)kable(max_upt, caption ="Transit System with Most UPT in 2022")
Transit System with Most UPT in 2022
Agency
Mode
UPT
MTA New York City Transit
Heavy Rail
1793073801
The answer shows us that the MTA New York City Transit is the transit system that had the most UTP with more than 1.79 billion of trips.
Which transit system (agency and mode) had the highest farebox recovery (Total Fares to Expenses ratio)?
`summarise()` has grouped output by 'Agency'. You can override using the
`.groups` argument.
kable(highest_farebox_recovery, caption ="Transit System with Highest Farebox Recovery Ratio")
Transit System with Highest Farebox Recovery Ratio
Agency
Mode
total_fares
total_expenses
farebox_recovery_ratio
Transit Authority of Central Kentucky
Vanpool
97300
40801
2.384746
The Transit System with the highest recovery ratio is the Transit Authority of Central Kentuchy, with the Vanpool mode and a recovery ratio of $2.38 USD.
3.Which transit system (agency and mode) has the lowest expenses per UPT?
`summarise()` has grouped output by 'Agency'. You can override using the
`.groups` argument.
kable(highest_fares_per_upt, caption ="Transit System with Highest Total Fares per UPT")
Transit System with Highest Total Fares per UPT
Agency
Mode
total_fares1
total_UPT
total_fares_per_UPT
Altoona Metro Transit
Demand Response
17163
26
660.1154
The Altoona Metro Transit with the Demand Response mode had the highest total fares per UPT. With just 26 trips, it collected a total in fares of $17,058 USD with a fare of $660 USD per UPT.
5.Which transit system (agency and mode) has the lowest expenses per VRM?
lowest_expenses_per_vrm <- USAGE_AND_FINANCIALS %>%filter(UPT >=400000) %>%mutate(expenses_per_vrm = Expenses / VRM) %>%arrange(expenses_per_vrm) %>%slice(1) %>%select(Agency, Mode, expenses_per_vrm)kable(lowest_expenses_per_vrm, caption ="Transit System with Lowest Expenses per VRM in 2022")
Transit System with Lowest Expenses per VRM in 2022
Agency
Mode
expenses_per_vrm
Metropolitan Transportation Commission
Vanpool
0.4449998
New Mexico Department of Transportation Vanpool mode has the lowest expense per VRM. Its fare per VRM is around 0.45.
6.Which transit system (agency and mode) has the highest total fares per VRM?
highest_fares_per_VRM <- USAGE_AND_FINANCIALS %>%mutate(fares_per_VRM =`Total Fares`/ VRM) %>%arrange(desc(fares_per_VRM)) %>%slice_max(fares_per_VRM, n =1) kable(highest_fares_per_VRM, caption ="Transit System with Highest Fares per VRM in 2022")
Transit System with Highest Fares per VRM in 2022
NTD ID
Agency
metro_area
Mode
UPT
VRM
Agency Name
Total Fares
Expenses
fares_per_VRM
50521
Chicago Water Taxi (Wendella)
Chicago, IL–IN
Ferryboat
16936
600
Chicago Water Taxi (Wendella)
142473
211296
237.455
The transit system that has the highest total fares per VRM is the Ferryboat from Chicago Water Taxi (Wendella) with a total fare per VRM of $237 USD.
Conclusion from the Mini Project 1 Analysis
This mini project provides several key insights into the fiscal characteristics, ridership, and operational efficiency of major U.S. public transit systems in 2022, based on data from the National Transit Database. Below are the major conclusions derived from each analysis:
Most Total Vehicle Revenue Miles (VRM)
Transit Agency: The MTA New York City Transit had the highest total Vehicle Revenue Miles (VRM) with over 10.8 billion VRM. Transit Mode: Buses had the most total VRM, accumulating more than 49.4 billion VRM across agencies. This indicates that the MTA operates the largest transit network in terms of vehicle miles, reflecting its role in serving a massive, densely populated area like New York City. Buses are crucial across transit systems, covering extensive distances.
Ridership Insights
NYC Subway Ridership: In May 2024, there were approximately 180 million trips taken on the NYC Subway (Heavy Rail), showing the continued importance of the subway in daily transportation. Ridership Decline: NYC Subway ridership fell drastically between April 2019 and April 2020 by 91%, primarily due to the COVID-19 pandemic, reflecting the severe impact of external shocks like pandemics on public transit usage. This emphasizes how ridership levels can be highly volatile in response to global events, such as health crises, with significant consequences for transit agencies’ revenue and service planning.
Highest Ridership in a Single Month
Metro Area: New York–Jersey City–Newark, NY–NJ had the highest number of trips in a single month (October 2014), with nearly 396 million unlinked passenger trips (UPT). This reinforces New York City’s dominance as the leading metropolitan area in the U.S. in terms of public transit usage, particularly for heavy rail and bus systems.
Bus Network Extent
Agency with Most Extensive Bus Network: New Jersey Transit Corporation had the most extensive bus network based on VRM, with over 3.78 billion VRM, reflecting its widespread service area and role in connecting commuters across state lines.
Ridership Trends (2022-2024)
Ridership saw a gradual recovery post-pandemic, with total UPT increasing from 6 billion in 2022 to nearly 7 billion in 2023. However, it significantly dropped again in 2024 to just over 4.18 billion, potentially indicating further external factors or a slow post-pandemic recovery.
Annual UPT and VRM for 2022
The analysis of UPT and VRM by transit mode and metro area in 2022 highlights regional differences in transit usage and network extent. For instance, King County (Seattle) had significant bus usage and VRM.
Farebox Recovery and Financial Efficiency
Highest Farebox Recovery: The Transit Authority of Central Kentucky had the highest farebox recovery ratio at 2.38, meaning the system collected more in fares than it spent on expenses (a rare occurrence in public transit). Lowest Expenses per UPT: A yet-to-be-confirmed agency/mode from the final analysis had the lowest expenses per UPT, reflecting operational efficiency in terms of cost per passenger. These insights into farebox recovery and financial efficiency help identify which systems are more sustainable and cost-effective, and which may require additional subsidies or fare increases to maintain financial stability.
Overall Conclusion
The project underscores the critical role that public transit systems like the MTA and New Jersey Transit play in urban mobility, especially in large metropolitan areas like New York City. However, the data also reveals significant vulnerabilities, particularly in ridership declines during major events like the pandemic, which drastically impacted revenue generation.
The findings related to farebox recovery, VRM, and UPT offer insights into which transit systems are the most efficient and financially viable, with a focus on reducing operational expenses and optimizing fare collection to meet expenses.