suppressPackageStartupMessages({
# Tidyverse and friends
library(tidyverse) # data handling and plotting
library(rvest) # scrape data
library(janitor) # misc cleaning
# Geography and travel
library(sf) # handle geographies
library(osrm) # fetch travel info
# Interactive elements
library(leaflet) # interactive maps
library(DT) # interactive tables
library(plotly) # interactive plots
})
tl;dr
The {osrm} R package can retrieve from the OSRM API the travel duration between points. I looked at these data for NBA basketball-team arenas, whose details I scraped from the web using {rvest} and mapped with {leaflet}.
Note
The original version of this post used the {gmapsdistance} package. I updated it extensively in 2020 to use the {osrm} package, which doesn’t require an API key nor billing details.
On the road
Fans don’t have far to travel in the UK if they want to see their favourite sports team play an away match.
The USA is pretty big, though.
The National Basketball Association (NBA) compensates by separating its teams into Eastern and Western conferences, each with three divisions of five teams. This means that the majority of regular-season games aren’t too far away.
But this varies. Teams are clustered near Lakes Michigan and Erie in the Central division, but the Northwest division stretches from Portland in the Pacific Northwest to Oklahoma City in the centre-south of the country.
What would it take to be a basketball fan who wanted to drive to away games? How long would it take?
R can help
Surprise, this is all a ruse for me to practice with some R packages:
- {rvest} for scraping web pages
- {leaflet} for interactive mapping
- {osrm} for calculating duration of travel between points
There’s four main parts to the post (click to jump):
Let’s start by attaching the packages we need. As always, make sure these are installed first, using install.packages()
.
1. Scrape team data
Use {rvest}
The Wikipedia page for the NBA has a table with each team and its location, including coordinates. We can use the {rvest} web-scraping package to extract that table into a data frame with these steps:
- Read the HTML of the page with
xml2::read_html()
- Extract the HTML node for the table with
rvest::html_nodes()
- Parse the HTML as a table with
rvest::html_table()
Note that you have to provide to html_nodes()
a CSS selector or an XPath that identifies the table’s ‘location’ in the HTML. You can find these using a tool like SelectorGadget, or with your browser’s ‘inspect’ tool (for Chrome, right-click the element on the page, select ‘inspect’, right-click the HTML for that element, go to ‘Copy’, then ’Copy full XPath). Beware: if the Wikipedia page changes, then this path could change in future.
<-
nba_scrape read_html("https://en.wikipedia.org/wiki/National_Basketball_Association") %>%
html_nodes(xpath = "/html/body/div[2]/div/div[3]/main/div[3]/div[3]/div[1]/table[4]") %>%
html_table(fill = TRUE, header = NA) %>%
1]] # list was returned, so extract first list element .[[
Here’s a preview:
glimpse(nba_scrape)
Rows: 32
Columns: 9
$ Division <chr> "Eastern Conference", "Atlantic", "Atlantic", "Atlantic", …
$ Team <chr> "Eastern Conference", "Boston Celtics", "Brooklyn Nets", "…
$ Location <chr> "Eastern Conference", "Boston, Massachusetts", "New York C…
$ Arena <chr> "Eastern Conference", "TD Garden", "Barclays Center", "Mad…
$ Capacity <chr> "Eastern Conference", "19,156", "17,732", "19,812", "20,47…
$ Coordinates <chr> "Eastern Conference", ".mw-parser-output .geo-default,.mw-…
$ Founded <chr> "Eastern Conference", "1946", "1967*", "1946", "1946*", "1…
$ Joined <chr> "Eastern Conference", "1946", "1976", "1946", "1949", "199…
$ `` <chr> "Eastern Conference", NA, NA, NA, NA, NA, NA, NA, NA, NA, …
So, the table has been returned, but it needs to be tidied up.
Wrangle the data
To summarise the main cleaning steps required:
- remove the rogue
NA
-filled column - filter out the spanning headers that identify the conferences
- add a column for each team’s conference
- make numeric the arena capacity
- separate city and state into separate columns
- isolate the latitude and longitude by separating them from the
Coordinates
column - remove the ‘zero width no-break space’ unicode character in the longitude column
- retain only the columns of interest
<- nba_scrape %>%
nba_wrangle select(-length(.)) %>% # remove the last column (NA)
::filter(!str_detect(Division, "Conference")) %>%
dplyrmutate(
Conference = c(rep("Eastern", 15), rep("Western", 15)),
Capacity = as.numeric(str_remove(Capacity, ","))
%>%
) separate(Location, c("City", "State"), sep = ", ") %>%
separate(Coordinates, c("Coords1", "Coords2", "Coords3"), " / ") %>%
separate(Coords3, c("Latitude", "Longitude"), sep = "; ") %>%
separate(Longitude, c("Longitude", "X"), sep = " \\(") %>%
mutate(
Latitude = as.numeric(Latitude),
Longitude = as.numeric(str_remove(Longitude, "\\ufeff")) # rogue unicode
%>%
) select(
everything(),
Team, Conference, -Founded, -Joined, -Coords1, -Coords2, -X
%>%
) as_tibble() # convert to tibble
glimpse(nba_wrangle)
Rows: 30
Columns: 9
$ Team <chr> "Boston Celtics", "Brooklyn Nets", "New York Knicks", "Phil…
$ Conference <chr> "Eastern", "Eastern", "Eastern", "Eastern", "Eastern", "Eas…
$ Division <chr> "Atlantic", "Atlantic", "Atlantic", "Atlantic", "Atlantic",…
$ City <chr> "Boston", "New York City", "New York City", "Philadelphia",…
$ State <chr> "Massachusetts", "New York", "New York", "Pennsylvania", "O…
$ Arena <chr> "TD Garden", "Barclays Center", "Madison Square Garden", "W…
$ Capacity <dbl> 19156, 17732, 19812, 20478, 19800, 20917, 19432, 20332, 179…
$ Latitude <dbl> 42.36630, 40.68265, 40.75056, 39.90111, 43.64333, 41.88056,…
$ Longitude <dbl> -71.06223, -73.97469, -73.99361, -75.17194, -79.37917, -87.…
Add more information
I made a table of three-letter team codes and colours for the markers and icons that will appear in the pins on the interactive map. I got these from teamcolorcodes.com. With {leaflet}. The markers can only take a small set of named colours (see ?awesomeIcons
), whereas the icon can use any CSS-valid colour (like hex codes).
Click for the code that creates a data frame of team codes and colours
<- tribble(
nba_abbr_cols ~Code, ~Franchise, ~colour_marker, ~colour_icon,
"ATL", "Atlanta Hawks", "red", "#C1D32F",
"BKN", "Boston Celtics", "black", "#FFFFFF",
"BOS", "Brooklyn Nets", "green", "#BA9653",
"CHA", "Charlotte Hornets", "darkblue", "#00788C",
"CHI", "Chicago Bulls", "red", "#000000",
"CLE", "Cleveland Cavaliers", "darkred", "#FDBB30",
"DAL", "Dallas Mavericks", "blue", "#B8C4CA",
"DEN", "Denver Nuggets", "darkblue", "#FEC524",
"DET", "Detroit Pistons", "red", "#1D42BA",
"GSW", "Golden State Warriors", "blue", "#FFC72C",
"HOU", "Houston Rockets", "red", "#000000",
"IND", "Indiana Pacers", "darkblue", "#FDBB30",
"LAC", "Los Angeles Clippers", "red", "#1D428A",
"LAL", "Los Angeles Lakers", "blue", "#FDB927",
"MEM", "Memphis Grizzlies", "lightblue", "#12173F",
"MIA", "Miami Heat", "red", "#F9A01B",
"MIL", "Milwaukee Bucks", "darkgreen", "#EEE1C6",
"MIN", "Minnesota Timberwolves", "darkblue", "#9EA2A2",
"NOP", "New Orleans Pelicans", "darkblue", "#C8102E",
"NYK", "New York Knicks", "blue", "#F58426",
"OKC", "Oklahoma City Thunder", "blue", "#EF3B24",
"ORL", "Orlando Magic", "blue", "#C4CED4",
"PHI", "Philadelphia 76ers", "blue", "#ED174C",
"PHX", "Phoenix Suns", "darkblue", "#E56020",
"POR", "Portland Trail Blazers", "red", "#000000",
"SAC", "Sacramento Kings", "purple", "#63727A",
"SAS", "San Antonio Spurs", "black", "#C4CED4",
"TOR", "Toronto Raptors", "red", "#000000",
"UTA", "Utah Jazz", "darkblue", "#F9A01B",
"WAS", "Washington Wizards", "darkblue", "#E31837"
)
head(nba_abbr_cols)
# A tibble: 6 × 4
Code Franchise colour_marker colour_icon
<chr> <chr> <chr> <chr>
1 ATL Atlanta Hawks red #C1D32F
2 BKN Boston Celtics black #FFFFFF
3 BOS Brooklyn Nets green #BA9653
4 CHA Charlotte Hornets darkblue #00788C
5 CHI Chicago Bulls red #000000
6 CLE Cleveland Cavaliers darkred #FDBB30
Now this extra information can be joined to our scraped and wrangled data frame from before.
<- nba_wrangle %>%
nba_table left_join(nba_abbr_cols, by = c("Team" = "Franchise")) %>%
select(Code, everything())
glimpse(nba_table)
Rows: 30
Columns: 12
$ Code <chr> "BKN", "BOS", "NYK", "PHI", "TOR", "CHI", "CLE", "DET", …
$ Team <chr> "Boston Celtics", "Brooklyn Nets", "New York Knicks", "P…
$ Conference <chr> "Eastern", "Eastern", "Eastern", "Eastern", "Eastern", "…
$ Division <chr> "Atlantic", "Atlantic", "Atlantic", "Atlantic", "Atlanti…
$ City <chr> "Boston", "New York City", "New York City", "Philadelphi…
$ State <chr> "Massachusetts", "New York", "New York", "Pennsylvania",…
$ Arena <chr> "TD Garden", "Barclays Center", "Madison Square Garden",…
$ Capacity <dbl> 19156, 17732, 19812, 20478, 19800, 20917, 19432, 20332, …
$ Latitude <dbl> 42.36630, 40.68265, 40.75056, 39.90111, 43.64333, 41.880…
$ Longitude <dbl> -71.06223, -73.97469, -73.99361, -75.17194, -79.37917, -…
$ colour_marker <chr> "black", "green", "blue", "blue", "red", "red", "darkred…
$ colour_icon <chr> "#FFFFFF", "#BA9653", "#F58426", "#ED174C", "#000000", "…
Now we have everything we need to visualise the data and fetch the travel duration times.
2. Map the locations
So where are all the arenas?
We can create a simple interactive map with {leaflet} by plotting the Latitude
and Longitude
columns and creating custom point markers with a basketball icon and each team’s colours, as well as an information box that appears on-click.
leaflet(nba_table) %>%
addProviderTiles(providers$Stamen.TonerLite) %>% # add basemap
addAwesomeMarkers( # add markers
lng = ~Longitude, lat = ~Latitude, # coordinates
popup = ~paste0( # HTML content for popup info
"<b>", nba_table$Team, "</b>", # team name
"<br>", paste0(nba_table$Arena, ", ", nba_table$City), # location
if_else( # division/conference information
$Conference == "Eastern",
nba_tablepaste0("<br><font color='#0000FF'>", nba_table$Division,
" Division (Eastern Conference)</font>"),
paste0("<br><font color='#FF0000'>", nba_table$Division,
" Division (Western Conference)</font>")
)
),icon = awesomeIcons(
library = "ion", icon = "ion-ios-basketball", # add basketball icon
markerColor = nba_table$colour_marker, # colour the marker
iconColor = nba_table$colour_icon # colour the basketball icon
)%>%
) addMeasure() # add straight-line distance-measuring tool
You can drag and zoom and click the points.
3. Get travel duration
So how far between these locations?
The {osrm} R package from Timothée Giraud, Robin Cura and Matthieu Viry lets you fetch shortest paths and travel times from OpenStreetMap via the OSRM API. It defaults to driving, but you can select walking and biking too. Since we’re using the demo server for OSRM, we can only fetch duration.
Duration matrix
The osrm::osrmTable()
function takes a data frame (or spatial object) where the first three columns are an identifier and coordinates. The return object is a list, where the first element is a matrix of durations for each pair of points.
<- nba_table %>%
nba_locs select(Code, Longitude, Latitude) %>%
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326)
<- osrmTable(loc = nba_locs)
nba_dur
glimpse(nba_dur)
List of 3
$ durations : num [1:30, 1:30] 0 282 276 385 650 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:30] "1" "2" "3" "4" ...
.. ..$ : chr [1:30] "1" "2" "3" "4" ...
$ sources :'data.frame': 30 obs. of 2 variables:
..$ lon: num [1:30] -71.1 -74 -74 -75.2 -79.4 ...
..$ lat: num [1:30] 42.4 40.7 40.7 39.9 43.6 ...
$ destinations:'data.frame': 30 obs. of 2 variables:
..$ lon: num [1:30] -71.1 -74 -74 -75.2 -79.4 ...
..$ lat: num [1:30] 42.4 40.7 40.7 39.9 43.6 ...
Duration: all teams
Let’s take this matrix and tidy it into a data frame so there’s one row per team-pair. We can also round to the nearest minute and calculate the nearest number of hours.
<-
nba_dur_all as.data.frame(nba_dur$durations) %>%
rownames_to_column("Start") %>%
mutate(Start = nba_locs$Code) %>%
rename_with(~c("Start", nba_locs$Code), all_of(names(.))) %>%
pivot_longer(
cols = BKN:SAS,
names_to = "End",
values_to = "Duration (mins)"
%>%
) mutate(
`Duration (mins)` = round_half_up(`Duration (mins)`),
`Duration (hrs)` = round_half_up(`Duration (mins)` / 60)
%>%
) arrange(desc(`Duration (mins)`))
Here’s a {DT} interactive table sorted by duration that you can filter. Click the ‘CSV’ button to download the data.
%>%
nba_dur_all datatable(
filter = "top",
extensions = c("Buttons","Scroller"),
class = "compact", width = "100%",
options = list(
dom = "Blrtip",
scroller = TRUE, scrollY = 300,
buttons = list("csv")
) )
So an incredible 58 hours of driving to get from Miami to Portland.
Duration: by division
We can also narrow this down to get only the team-pairs that play in the same division as each other.
<- nba_dur_all %>%
nba_dur_div left_join(select(nba_table, Code, Division), by = c("Start" = "Code")) %>%
left_join(select(nba_table, Code, Division), by = c("End" = "Code")) %>%
::filter(Division.x == Division.y, `Duration (mins)` != 0) %>%
dplyrselect(Division = Division.x, everything(), -Division.y) %>%
arrange(Division, desc(`Duration (mins)`))
Again, here’s an interactive table that you can use to explore the data. Note that it’s ordered by Division and then duration in minutes. I’ve hidden the code because it’s the same as for the table above.
Click for the {DT} code
%>%
nba_dur_div datatable(
filter = "top",
extensions = c("Buttons","Scroller"),
rownames = FALSE,
class = "compact", width = "100%",
options = list(
dom = "Blrtip",
scroller = TRUE, scrollY = 300,
buttons = list("csv")
) )
This time we can see that there’s a maximum of 33 hours of driving required between two teams in the same division: Portland to Oklahoma City.
A quick diversion: routing
We know from using osrm::osrmTable()
that Miami to Portland has the longest travel duration. What’s the route?
Fortunately, {osrm} has the function osrmRoute()
for fetching the routes between a pair of points.
We can grab a vector of coordinates for each team from our nba_table
object and set these as our origin (src
) and destination (dst
) in osrm::osrmRoute()
. The return object is a ‘linestring’ object that contains detail on the coordinates and coordinate system for the route.
# Function to extract latlong vectors for teams
<- function(data, team_code) {
get_ll <- dplyr::filter(data, Code == team_code)
team_data <- pull(team_data, Longitude)
lng <- pull(team_data, Latitude)
lat <- c(lng, lat)
lnglat return(lnglat)
}
# Get route between latlong pairs
<- osrmRoute(
route src = get_ll(nba_table, "MIA"),
dst = get_ll(nba_table, "POR"),
returnclass = "sf"
)
Warning: "returnclass" is deprecated.
route
Simple feature collection with 1 feature and 4 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: -122.6658 ymin: 25.78202 xmax: -80.15664 ymax: 45.8407
Geodetic CRS: WGS 84
src dst duration distance geometry
src_dst src dst 3461.665 5247.537 LINESTRING (-80.18809 25.78...
Now we can set up the same type of {leaflet} map as earlier, but we’ll include only Portland and OKC. I’ve hidden the map definition because it’s almost the same as before.
Click for the {leaflet} map definition
<- nba_table %>%
mia_por ::filter(Code %in% c("MIA", "POR"))
dplyr
<-
mia_por_map leaflet(mia_por) %>%
addProviderTiles(providers$Stamen.TonerLite) %>% # add basemap
addAwesomeMarkers( # add markers
lng = ~Longitude, lat = ~Latitude, # coordinates
popup = ~paste0( # HTML content for popup info
"<b>", mia_por$Team, "</b>", # team name
"<br>", paste0(mia_por$Arena, ", ", mia_por$City), # location
if_else( # division/conference information
$Conference == "Eastern",
mia_porpaste0("<br><font color='#0000FF'>", mia_por$Division,
" Division (Eastern Conference)</font>"),
paste0("<br><font color='#FF0000'>", mia_por$Division,
" Division (Western Conference)</font>")
)
),icon = awesomeIcons(
library = "ion", icon = "ion-ios-basketball", # add basketball icon
markerColor = mia_por$colour_marker, # colour the marker
iconColor = mia_por$colour_icon # colour the basketball icon
)%>%
) addMeasure() # add straight-line distance-measuring tool
And to that map we can add the line that defines the route
%>% addPolylines(data = st_geometry(route)) mia_por_map
That’s a long way.
4. Make a heatmap
A quick way to visualise the data is to create a heatmap, where we take a matrix of teams in each division and colour by duration. Here, lighter colours indicate greater travel duration.
The plot is interactive; you can hover over squares in each facet to see specific information about that pair, including the exact duration value.
<- nba_dur_div %>%
p ggplot(aes(Start, End)) +
geom_tile(aes(fill = `Duration (hrs)`)) +
xlab("") + ylab("") +
facet_wrap(~Division, scales = "free")
ggplotly(p)
Note the light colours in the Northwest division where teams have to travel far (like the 33 hour trip from Portland and Oklahoma City), while travel durations in the Atlantic and Central divisions are shorter. Of course, the Clippers and Lakers both play in the Staples Center in LA, so their journey time is zero.
Ending the journey
So, this post shows the the power of the {osrm} package for travel distance, duration and routing information.
Of course, it’s never usually as simple as having your geographic data ready to go, so I hope this post also provides a good use-case for {rvest} to help you collect information and {tidyverse} for wrangling it.
The plots here are pretty minimal, but they hopefully give a flavour of how to use {leaflet} for plotting points and the routing between them according to {osrm}.
This post was initially written before the travel restrictions brought about by the 2020 pandemic. Of course, the maps would have been much simpler during for the 2020 playoffs, which all took place in a ‘bubble’ at Disney World, Florida.
Environment
Session info
Last rendered: 2023-08-05 16:58:43 BST
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/London
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] plotly_4.10.2 DT_0.28 leaflet_2.1.2 osrm_4.1.1
[5] sf_1.0-14 janitor_2.2.0 rvest_1.0.3 lubridate_1.9.2
[9] forcats_1.0.0 stringr_1.5.0 dplyr_1.1.2 purrr_1.0.1
[13] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2
[17] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.3 bslib_0.5.0 xfun_0.39
[4] htmlwidgets_1.6.2 tzdb_0.4.0 leaflet.providers_1.9.0
[7] vctrs_0.6.3 tools_4.3.1 crosstalk_1.2.0
[10] generics_0.1.3 curl_5.0.1 proxy_0.4-27
[13] fansi_1.0.4 pkgconfig_2.0.3 KernSmooth_2.23-21
[16] data.table_1.14.8 lifecycle_1.0.3 farver_2.1.1
[19] compiler_4.3.1 googlePolylines_0.8.3 munsell_0.5.0
[22] fontawesome_0.5.1 snakecase_0.11.0 sass_0.4.7
[25] htmltools_0.5.5 class_7.3-22 yaml_2.3.7
[28] lazyeval_0.2.2 jquerylib_0.1.4 pillar_1.9.0
[31] ellipsis_0.3.2 classInt_0.4-9 cachem_1.0.8
[34] tidyselect_1.2.0 digest_0.6.33 stringi_1.7.12
[37] labeling_0.4.2 fastmap_1.1.1 grid_4.3.1
[40] colorspace_2.1-0 cli_3.6.1 magrittr_2.0.3
[43] utf8_1.2.3 e1071_1.7-13 RcppSimdJson_0.1.10
[46] withr_2.5.0 scales_1.2.1 timechange_0.2.0
[49] rmarkdown_2.23 httr_1.4.6 hms_1.1.3
[52] evaluate_0.21 knitr_1.43.1 viridisLite_0.4.2
[55] rlang_1.1.1 Rcpp_1.0.11 isoband_0.2.7
[58] glue_1.6.2 DBI_1.1.3 xml2_1.3.5
[61] mapiso_0.3.0 rstudioapi_0.15.0 jsonlite_1.8.7
[64] R6_2.5.1 units_0.8-2