May 7, 2019
In this worksheet, we will use the libraries tidyverse and sf:
library(tidyverse)
theme_set(theme_bw(base_size=12)) # set default ggplot2 theme
library(sf) # needed for simple feature manipulation
We will work with two data frames, US_income
and US_counties_income
, which contain the median income and population number of US states or US counties, respectively.
# load all data
load(url("https://wilkelab.org/classes/SDS348/data_sets/US_income.RData"))
# income and population data by state
head(US_income)
## Simple feature collection with 6 features and 7 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -2356114 ymin: -778242.8 xmax: 1986024 ymax: 845925.2
## epsg (SRID): NA
## proj4string: +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## GEOID name median_income median_income_moe population
## 1 01 Alabama 43623 281 4830620
## 2 04 Arizona 50255 211 6641928
## 3 05 Arkansas 41371 247 2958208
## 4 06 California 61818 156 38421464
## 5 08 Colorado 60629 252 5278906
## 6 09 Connecticut 70331 409 3593222
## area popdens geometry
## 1 133958437749 [m^2] 3.606059e-05 [1/m^2] MULTIPOLYGON (((1032679 -63...
## 2 295232708152 [m^2] 2.249726e-05 [1/m^2] MULTIPOLYGON (((-1216674 -4...
## 3 137792577218 [m^2] 2.146856e-05 [1/m^2] MULTIPOLYGON (((462619.4 -3...
## 4 410516610493 [m^2] 9.359296e-05 [1/m^2] MULTIPOLYGON (((-2077630 -2...
## 5 269580118211 [m^2] 1.958196e-05 [1/m^2] MULTIPOLYGON (((-527710.6 3...
## 6 12961831628 [m^2] 2.772156e-04 [1/m^2] MULTIPOLYGON (((1841099 622...
# income and population data by county
head(US_counties_income)
## Simple feature collection with 6 features and 15 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -2284310 ymin: -146995 xmax: 2024652 ymax: 1066541
## epsg (SRID): NA
## proj4string: +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## STATEFP COUNTYFP COUNTYNS AFFGEOID GEOID NAME LSAD
## 1 06 075 00277302 0500000US06075 06075 San Francisco 06
## 2 25 025 00606939 0500000US25025 25025 Suffolk 06
## 3 31 007 00835826 0500000US31007 31007 Banner 06
## 4 37 181 01008591 0500000US37181 37181 Vance 06
## 5 48 421 01383996 0500000US48421 48421 Sherman 06
## 6 50 011 01461762 0500000US50011 50011 Franklin 06
## ALAND AWATER name median_income
## 1 121485107 479107241 San Francisco County, California 81294
## 2 150855462 160479920 Suffolk County, Massachusetts 55044
## 3 1932676697 397069 Banner County, Nebraska 48897
## 4 653705784 42187365 Vance County, North Carolina 33316
## 5 2390651189 428754 Sherman County, Texas 51987
## 6 1641633748 150930318 Franklin County, Vermont 58199
## median_income_moe population area popdens
## 1 1099 840763 113979848 [m^2] 7.376418e-03 [1/m^2]
## 2 992 758919 180163309 [m^2] 4.212395e-03 [1/m^2]
## 3 4107 820 1926477562 [m^2] 4.256473e-07 [1/m^2]
## 4 1974 44829 673640216 [m^2] 6.654739e-05 [1/m^2]
## 5 4386 3066 2387929738 [m^2] 1.283957e-06 [1/m^2]
## 6 2034 48418 1798183349 [m^2] 2.692606e-05 [1/m^2]
## geometry
## 1 MULTIPOLYGON (((-2283315 35...
## 2 MULTIPOLYGON (((2009657 799...
## 3 MULTIPOLYGON (((-664543.3 4...
## 4 MULTIPOLYGON (((1544259 321...
## 5 MULTIPOLYGON (((-546533.1 -...
## 6 MULTIPOLYGON (((1780210 102...
Both data frames contain the boundaries of the lower 48 states as well as Alaska and Hawaii. For easier visualization, Alaska and Hawaii have been moved to lie underneath the lower 48 states. We can plot the geographic boundaries with geom_sf()
(“sf” stands for “simple features”). Note that for a basic plot, we don’t need to specify an aesthetic mapping, because geometry columns are automatically found and mapped by geom_sf()
.
ggplot(US_income) +
geom_sf()
We can map any of the other data values onto the map using standard ggplot2 techniques. For example, we can color states by the logarithm of the population number.
ggplot(US_income, aes(fill = log(population))) +
geom_sf(color = "black", size = 0.2) # draw state boundaries with thin black lines
We can remove states we are not interested in by filtering, just like we normally do when working with the tidyverse.
# remove Alaska and Hawaii
lower48 <- US_income %>%
filter(!GEOID %in% c("02", "15"))
# plot
ggplot(lower48) + geom_sf()
We can change the coordinate system (i.e., reproject the geometric shapes) by adding coord_sf()
with a coordinate reference system (crs). Many coordinate reference systems are specified by EPSG (European Petroleum Survey Group) codes, which can be looked up at https://epsg.io/ or https://spatialreference.org. For example, we can use EPSG 3395, which is an outdated Mercator projection.
ggplot(lower48) +
geom_sf() +
coord_sf(crs = 3395) # World Mercator, not recommended in practice, https://spatialreference.org/ref/epsg/3395/
Or, we could use EPSG 3338, which is a projection that is normally used for Alaska.
ggplot(lower48) +
geom_sf() +
coord_sf(crs = 3338) # Normally used for Alaska, https://spatialreference.org/ref/epsg/nad83-alaska-albers/
Plot the lower 48 states in a coordinate system that represents longitude along the x axis and latitude along the y axis. Hint: This is called the longitude/latitude projection, and it has an EPSG code of 4326.
ggplot(lower48) +
geom_sf() +
coord_sf(crs = 4326) # Cartesian longitude and latitude
Using the data frame US_counties_income
, plot all US counties, coloring each one by median income. Hint: Use scale_fill_viridis_c(option = "B")
to create an appealing color effect.
ggplot(US_counties_income, aes(fill = median_income)) +
geom_sf(color = NA) + # set `color = NA` to hide county boundaries
scale_fill_viridis_c(option = "B")
Now plot only the counties of Texas, coloring each one by median income. Hint: Texas is represented by a code of "48"
, and this code is stored in the STATEFP
column in the data frame with county information.
US_counties_income %>%
filter(STATEFP == "48") %>%
ggplot(aes(fill = median_income)) +
geom_sf(color = NA) +
scale_fill_viridis_c(option = "B")
Make a map of all the counties in the lower 48, with counties with a median income of at least $75,000 highlighted in red.
US_counties_income %>%
filter(!STATEFP %in% c("02", "15")) %>% # remove Alaska and Hawaii
filter(!is.na(median_income)) %>% # remove counties with missing data
mutate( # classify counties by high/low median income
med_income = ifelse(median_income >= 75000, ">= 75K", "< 75K")
) %>%
ggplot(aes(fill = med_income)) +
geom_sf(color = NA) +
scale_fill_manual(values = c("grey75", "red"))