23  Geospatial Analysis with sf

Abstract
In this chapter, we introduce mapping and geospatial analysis using the simple features framework in R.

The Tabula Rogeriana was an early world atlas by geographer Al Idrisi. This is a reconstituion by Konrad Miller.

23.1 Motivation

  • Many data are inherently spatial. We need tools to manipulate, explore, communicate, and model spatial data.
  • Point-and-click tools suffer many of the drawbacks outlined in earlier note sets.
  • Proprietary geospatial tools are prohibitively expensive.
  • Everyone loves maps.

23.2 library(sf) and geospatial data

Geospatial data are multidimensional. The outline of a simple rectangular state like Kansas may have a few vertices and a complex state like West Virginia may have many more vertices. Furthermore, geospatial data have auxiliary information about bounding boxes and projections. Fortunately, library(sf) can store all of this information in a rectangular tibble that works well with library(dplyr) and library(ggplot2).

library(sf) stores geospatial data, which are points, linestrings, or polygons in a geometry column. Consider this example:

library(sf)
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
amtrak_stations_file <- 
  here("data", "amtrak_stations.geojson")

# download Amtrak data
if(!file.exists(amtrak_stations_file)) {
  
  download.file(
    url = "https://opendata.arcgis.com/datasets/628537f4cf774cde8aa9721212226390_0.geojson",
    destfile = amtrak_stations_file
  )
  
}

# read sf data
amtrak_stations <- st_read(amtrak_stations_file, quiet = TRUE)

# print the geometry column
# NOTE: geometry columns are "sticky" -- geometry is returned even though it 
# isn't include in select()
amtrak_stations |>
  select(stationnam)
Simple feature collection with 1096 features and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -124.2883 ymin: 25.84956 xmax: -68.67001 ymax: 49.27376
Geodetic CRS:  WGS 84
First 10 features:
              stationnam                   geometry
1               Alma, MI POINT (-84.64484 43.39173)
2             Albany, NY  POINT (-73.80919 42.7445)
3   Abbotsford-Colby, WI POINT (-90.31467 44.92856)
4           Aberdeen, MD POINT (-76.16326 39.50845)
5            Absecon, NJ POINT (-74.50148 39.42405)
6        Albuquerque, NM  POINT (-106.648 35.08207)
7  Antioch-Pittsburg, CA  POINT (-121.816 38.01771)
8            Arcadia, MO POINT (-90.62441 37.59217)
9      Atlantic City, NJ   POINT (-74.4399 39.3627)
10           Ardmore, OK POINT (-97.12552 34.17247)

23.2.1 Points

Points

Points are zero-dimensional geometries. Points are coordinates and are frequently represented by a single longitude and latitude.

Let’s map the Amtrak stations from the above example:

# create a map
amtrak_stations |>
  ggplot() +
  geom_sf() +
  labs(
    title = "Example of point data",
    caption = "Amtrak stations from opendata.arcgis.com") +
  theme_void()

23.2.2 Linestrings

Linestrings

Linestrings are one-dimensional geometries. Linestrings are a sequence of points connected by lines.

Here is an example of Amtrak train routes from the Bureau of Transportation Statistics:

amtrak_routes_file <- here("data", "Amtrak_Routes.geojson")

# load sf data
amtrak_routes <- st_read(amtrak_routes_file, quiet = TRUE)

# create map with lines data
amtrak_routes |>
  ggplot() +
  geom_sf() +
  labs(
    title = "Example of line data",
    caption = "Amtrak routes from Bureau of Transportation Statistics"
  ) +
  theme_void()

23.2.3 Polygons

Polygons

Polygons are two-dimensional geometries. Polygons are a sequence of points connected by lines that form a closed shape.

Here is a simple example with of the Continental United States:

# load and subset states data
states <- tigris::states(cb = TRUE, progress_bar = FALSE) |>
  filter(!STATEFP %in% c("78", "69", "66", "60", "72", "02", "15"))
Retrieving data for the year 2021
states |>
  ggplot() +
  geom_sf() +
  labs(title = "Example of polygon data") +
  theme_void()

Multipoints, multilinestrings, and multipolygons are additional geometries based on points, linestrings, and polygons. The sf documentation outlines many geometry types.

Simple features

Simple features is a standard for storing and accessing geometric features outlined in the ISO 19125 standard.

library(sf) implements simple features in R and introduces sf data.

23.3 geom_sf()

geom_sf() plots sf data. The function automatically references the geometry column and does not require any aesthetic mappings. geom_sf() works well with layers and it is simple to combine point, line, and polygon data.

geom_sf() works like geom_point() for point data, geom_line() for linestring data, and geom_area() for polygon data (e.g. fill controls the color of shapes and color controls the border colors of shapes for polygons).

amtrak_map <- ggplot() +
  geom_sf(
    data = states,
    fill = NA
  ) +
  geom_sf(
    data = amtrak_routes, 
    color = "blue"
  ) +
  geom_sf(
    data = amtrak_stations, 
    color = "red", 
    size = 0.75,
    alpha = 0.5
  ) +
  labs(title = "Amtrak stations and train routes") +
  theme_void()

amtrak_map

23.4 Getting and loading spatial data

23.4.1 Shapefiles

Shapefiles are a proprietary file format created by ESRI, the company that creates ArcGIS. Shapefiles are popular because ESRI dominated the GIS space for a long time. A shapefile is actually usually three or more binary files.

st_read() reads shapefiles into R in the sf format. Simply point the function to the file ending in .shp. Note that the other binary files associated with the shapefile must also be located in the same directory as the .shp file.

23.4.2 GeoJSON

.geojson is an open source file type for storing geospatial data. It is plain text, which means it plays well with version control.

st_read() also reads GeoJSON data. Point the function at a file ending in .geojson to read the data.

23.4.3 .csv files

Lots of geographic information is stored in .csv files–especially for point data where it is sensible to have a longitude and latitude columns. Loading point data from .csv files requires two steps. First, read the file with read_csv(). Second, use st_as_sf() to convert the tibble into an sf object and specify the columns with longitude and latitude with the coords argument:

st_as_sf(data, coords = c("lon", "lat"))

Note: It is trickier (but far less common) to load line, polygon, and multipolygon data from .csvs.

23.4.4 library(tigris)

library(tigris) is an exceptional package that downloads and provides TIGER/Line shapefiles from the US Census Bureau. TIGER stands for Topologically Integrated Geographic Encoding and Referencing.

The package provides lots of data with simple functions (full list here) like counties() to access counties, tracts() to access census tracts, and roads() to access roads. The state and county arguments accept names and FIPS codes.

library(tigris) has a new shift option that elides Alaska and Hawaii next to the Continental United States.

Exercise 1
  1. Using library(tigris), pull roads data for DC with state = "DC" and county = "District of Columbia".
  2. Create a map with geom_sf() and theme_void().

TIGER line files are high-resolution and follow legal boundaries. Sometimes the maps are counterintuitive. For example, the outline of Michigan will include the Great Lakes, which is uncommon. Cartographic boundary files are quicker to download and are clipped to the coastline, which better aligns with expectations.

library(tigris)
To enable caching of data, set `options(tigris_use_cache = TRUE)`
in your R script or .Rprofile.
library(patchwork)

mi <- states(progress_bar = FALSE) |>
  filter(STUSPS == "MI") |>
  ggplot() +
  geom_sf() +
  labs(title = "TIGER/Line") +
  theme_void() 
Retrieving data for the year 2021
mi_cb <- states(cb = TRUE, progress_bar = FALSE) |>
  filter(STUSPS == "MI") |>
  ggplot() +
  geom_sf() +
  labs(title = "Cartographic Boundary") +  
  theme_void()
Retrieving data for the year 2021
mi + mi_cb

library(rgeoboundaries)

The rgeoboundaries package is a client for the geoBoundaries API, providing country political administrative boundaries for countries around the world. This package can be installed from GitHub using the remotes package as follows

The rgeoboundaries package can provide boundaries for countries at different administrative division levels. For example, here we obtain the adm1 boundaries (the first subnational level) for Bangladesh. The type argument of the geoboundaries() function can be set to obtain a simplified version of the boundaries.

library(rgeoboundaries)

bangladesh <- geoboundaries(
  country = "Bangladesh", 
  adm_lvl = "adm1",
  type = "SSCGS" # Simplified Single Country Globally Standardized
) 

ggplot(data = bangladesh) +
  geom_sf()

Exercise 2

We want to read in and map the locations of World Bank development projects in Bangladesh downloaded from AidData, which includes geographic and other information about development projects.

  1. Copy the below code into your script to read in aiddata_bangladesh.csv with read_csv().
aiddata <- read_csv(
  paste0(
    "https://raw.githubusercontent.com/awunderground/awunderground-data/",
    "main/aiddata/aiddata_bangladesh.csv"
    )
  )
Rows: 2648 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): project_id, project_location_id, place_name, location_type_code, l...
dbl  (6): precision_code, geoname_id, latitude, longitude, location_class, g...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
  1. Use st_as_sf() to convert the .csv to sf.
  2. Use st_set_crs(value = 4326) to set the CRS (we will discuss below).
  3. Add a basemap of adm1 boundaries for Bangladesh using the bangladesh object created above.
  4. Map the Bangladesh development project data with color = status.

23.4.5 library(tidycensus)

library(tidycensus), which was created by the creator of library(tigris), is also a valuable source of geographic data. Simply include geometry = TRUE in functions like get_acs() to pull the shapes data as sf. The state and county arguments accept names and FIPS codes.

library(tidycensus) sometimes requires the same Census API Key we used in the API tutorial (sign up here). You should be able to install your API key into your version of R with census_api_key("your-key-string", install = TRUE). To obtain your keystring, you can use library(dotenv) and Sys.getenv(<key name in .env file>).

library(tidycensus) has two big differences from library(tigris): 1. it can pull Census data with the geographic data, and 2. it only provides cartographic boundary files that are smaller and quicker to load and more familiar than TIGER/Line shapefiles by default.

library(tidycensus)
Warning: package 'tidycensus' was built under R version 4.3.3
dc_income <- get_acs(
  geography = "tract", 
  variables = "B19013_001", 
  state = "DC", 
  county = "District of Columbia", 
  geometry = TRUE,
  year = 2019,
  progress = FALSE
)
Getting data from the 2015-2019 5-year ACS
Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.

Both library(tigris) and library(tidycensus) have a year parameter that determines the year of the data obtained in functions like tidycensus::get_acs() or tigris::states(). This parameter currently defaults to 2020 for get_acs() and tigris functions. tidycensus::get_acs() also notably defaults to pulling 5-year ACS data. We recommend reading the documentation for these functions to understand the parameter options and their default values.

23.5 Choropleths

Choropleths are maps that use fill to display the variation in a variable across geographies.

Getting data from the 2018-2022 5-year ACS
Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.

Exercise 3
  1. Copy the code that pulls income in DC by census tract under library(tidycensus).
  2. Try to recreate the above choropleth using fill.
  3. Hint: Use scale_fill_gradient() with low = "#cfe8f3" and high = "#062635".

23.6 Spatial concepts

Geospatial work requires making an assumption about the shape of the Earth and a decision about how to project three-dimensions on to a two-dimensional surface. Note: the Earth is an ellipsoid where the diameter from pole-to-pole is smaller than from equator to equator. In other words, it is swollen at the equator.

Geographic coordinate reference systems

Geographic coordinate reference systems represent geographies with a three-dimensional representation of the Earth. (i.e., The data have not been projected from what we think of as a globe to what we think of as a map). Data are typically stored as longitude and latitude.

Projected coordinate reference system

Projected coordinate reference system represent geographies with a two-dimensional representation of the Earth. A projected CRS is the combination of a geographic CRS and a projection. Data are typically stored in feet or meters, which is useful for distance-based spatial operations like calculating distances and creating buffers.

Projection

A projection is a mathematical transformation that converts three-dimensional coordinates for a spheroid/ellipsoid into a two-dimensions.

23.7 Spatial operations

We often want to manipulate spatial data or use spatial data for calculations. This section covers a few common operations.

23.7.1 Aggregation

Sometimes we want to aggregate smaller geographies into larger geographies. This is simple with a group_by() and summarize() workflow. Suppose we want to combine North Dakota and South Dakota into Dakota.

# states to combine
dmv_names <- c("South Dakota", "North Dakota")

# add a projection
states <- states |>
  st_transform(crs = 5070)

# aggregate states and make a map
states |>
  mutate(new_NAME = if_else(NAME %in% dmv_names, "Dakota", NAME)) |>
  group_by(new_NAME) |>
  summarize() |>
  ggplot() +
  geom_sf()

23.7.2 Spatial Joins

Spatial joins are joins like left_join() but the join is based on geography instead of “by” variables. Note: both geographies must have the same CRS. Like with any join, it is important to track the number of rows before and after joins and to note that joins may be one-to-one, one-to-many.

st_join() performs a left spatial join in R. st_intersects means observations will join if the geographies in x touch the geographies in y. The sf package offers a number of different geometric confirmations that can be used for spatial joins, such as st_covered_by (identifies if x is copletely within y), st_within (identifies if x is within a specified distance of y) and many more. The sf cheat sheet provides a good outline of the different options.

Suppose we want to count the number of Amtrak stations in each state.

# set states CRS to 4326 to match the Amtrak data
amtrak_stations <- st_transform(amtrak_stations, crs = 5070)

# dimension before join
dim(amtrak_stations)
[1] 1096   12
# spatial join using intersection
amtrak <- st_join(states, amtrak_stations, join = st_intersects)

# dimension after join -- lose international stations
dim(amtrak)
[1] 1080   21
# count the number of stations per state
amtrak |>
  as_tibble() |> # converting from sf to tibble speeds calculations
  group_by(NAME) |>
  summarize(number_of_stations = n()) |>
  arrange(desc(number_of_stations))
# A tibble: 49 × 2
   NAME         number_of_stations
   <chr>                     <int>
 1 California                  184
 2 Oregon                       87
 3 New York                     80
 4 Pennsylvania                 63
 5 Michigan                     54
 6 Washington                   50
 7 Illinois                     41
 8 Wisconsin                    37
 9 Florida                      31
10 Texas                        31
# ℹ 39 more rows

23.7.3 Buffers

Adding buffers to points, lines, and polygons is useful for counting shapes near other shapes. For instance, we may want to count the number of housing units within 500 meters of a metro station or the number of schools more than 5 miles from a fire station.

Suppose we want to count the number of Amtrak stations within 5 kilometers of each Amtrak station. We can buffer the Amtrak station points, join the unbuffered data to the buffered data, and then count.

# add a buffer of 5 kilometers to each station
# the units package is useful for setting buffers with different units
amtrak_stations_buffered <- st_buffer(
  amtrak_stations, 
  dist = units::set_units(5, "km")
)

# spatial join the unbuffered shapes to the buffer shapes
amtrak_stations_joined <- st_join(
  amtrak_stations_buffered,
  amtrak_stations, 
  join = st_intersects
)

# count the station names
amtrak_stations_joined |>
  as_tibble() |>
  count(stationnam.x, sort = TRUE)
# A tibble: 1,007 × 2
   stationnam.x                   n
   <chr>                      <int>
 1 Monterey, CA                  27
 2 Yosemite National Park, CA    17
 3 Boston, MA                     9
 4 Salem, OR                      9
 5 Seaside, OR                    8
 6 Manchester Center, VT          6
 7 Eugene, OR                     5
 8 Las Vegas, NV                  5
 9 Oakland, CA                    5
10 Palm Springs, CA               5
# ℹ 997 more rows

23.7.4 Distances

Euclidean distance is common for calculating straight line distances but does not make sense for calculating distances on the surface of an ellipsoid like Earth.

\[d(\vec{p}, \vec{q}) = \sqrt{\sum_{i = 1}^n(q_i - p_i)^2}\]

Instead, it is common to use Haversine distance which accounts for the curvature in the globe.

Figure 23.4: Haversine Distance

Suppose we want to find the closest and furthest Amtrak stations from Washington DC’s Union Station.

# create a data frame with just Union Station
union_station <- amtrak_stations |>
  filter(state == "DC")

no_union_station <- amtrak_stations |>
  filter(state != "DC") 

# calculate the distance from Union Station to all other stations
amtrak_distances <- st_distance(union_station, no_union_station)

# find the closest station
amtrak_stations |>
  slice(which.min(amtrak_distances)) |>
  select(stationnam, city)
Simple feature collection with 1 feature and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 1618448 ymin: 1915016 xmax: 1618448 ymax: 1915016
Projected CRS: NAD83 / Conus Albers
      stationnam       city                geometry
1 Alexandria, VA Alexandria POINT (1618448 1915016)
# find the further station
amtrak_stations |>
  slice(which.max(amtrak_distances)) |>
  select(stationnam, city)
Simple feature collection with 1 feature and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -2328052 ymin: 2299661 xmax: -2328052 ymax: 2299661
Projected CRS: NAD83 / Conus Albers
   stationnam    city                 geometry
1 Fortuna, CA Fortuna POINT (-2328052 2299661)

23.8 Geospatial modeling

Cross-sectional regression models assume that the error term is independently and identically distributed, which in turn means the dependent variable is independently and identically distributed. This is often a reasonable assumption.

The assumption of independence often falls apart with spatial data. If number of coal power plants in a state is an independent variable and atmospheric carbon dioxide in a state is the dependent variable, then it doesn’t make much sense to assume that North Carolina and South Carolina are independent. If South Carolina has many coal burning power plants, then the emissions could affect atmospheric carbon dioxide in North Carolina.

Spatial regression methods attempt to account for this dependence between observations.

Spatial autocorrelation: Correlation between observations that are geographically close.

Process:

  1. Estimate a non-spatial regression model.
  2. Use tests of spatial autocorrelation like Moran’s I, Geary’s c, or Getis and Ord’s G-statistic on the residuals to test for spatial autocorrelation.
  3. If spatial autocorrelation exists, the use a spatial error model or a spatial lag model.

23.9 Parting Thoughts

This note set works entirely with vector spatial data. Vector data consists of vertices turned into points, lines, and polygons.

Some spatial data are raster data, which are stored as pixels or grid cells. For example, a raster data set may have an even one square mile grid over the entire United States with data about the amount of soy crops within each pixel. It is common for satellite data to be converted to rasters. This website contains good example of raster data.

23.10 More Resources