Chapter 9 Spatial Data Visualization
In this chapter, we will learn how to deal with the geospatial dataset with packages
sf. We will also interact it with
ggplot2. Link for the slides of this lecture.
9.1 Why spatial visualization with R?
The common solutions for creating maps include GIS (Geographic Information System) software: ArcGIS, QGIS, etc. They are powerful in spatial data visualization and analysis. However, they do have some disadvantages, such as the workflow is not easy to reproduce, not good at statistical computing, etc. As R has more and more powerful packages that could deal with geospatial data, many researchers and practitioners start to use R in their work. Comparing ArcGIS (or QGIS) with R, is similar with comparing Stata with R. Both of them have pros and cons. Whether or not you use them depends on your personal preference and task requirements.
Then when we should use R to do spatial data visualization, not ArcGIS? Well, one situation is that you do not have access to ArcGIS (ArcGIS is not free), then R is a good choice for you. Another situation could be that you have used R for your data analysis and then you could continue to use R to do spatial visualization. You do not need to change the working environment in this case. Or if you want to make your work more reproducible, R is a good choice since its codes can be recycled for other similar work and you could check your codes to track your data analysis history.
9.2 Some basics about spatial data
Spatial data has some differences with our common data. The largest one is that it has spatial attributes for indicate locations. Those spatial attributes could be points, lines, or/and ploygon. It could also be the combination of these elements. Points could be the positions for a single-family house, an intersection, a bar, etc. Lines could be the segments streets, rivers, etc. Polygons could be the boundaries states, counties, cities, etc. As shown in the figure below Source. While they are different spatial attributes, they are all combined with points.
For a spatial data file in R, it is similar to the common dataset. It is a 2-D table with each column for a variable. The spatial data also has a column called
geometry. This is the variable indicating the spatial attributes. For example, for a spatial file including the points of several cities in Minnesota. Its geometry column will be the longitudes and latitudes of those points. If a spatial file contains the boundaries of the states in the U.S., then its geometry column will be the locations of the polygons. If a spatial data has the information of points, it is called point spatial data. If a spatial data has the information of lines, it is called line spatial data. If a spatial data has the information of polygons, it is called polygon data. A spatial data file can only include one type of spatial attribute.
For all geospatial variables, they should have a very important information called coordinate reference system (CRS). We will not expand the concept in this lecture. The simple idea is that it is the system to locate geographical entities. For example, you could use longitude and latitude to locate the position of everything on earth, which is a 3D sphere. Or you could use another type of CRS to locate the position of something in a 2D flattened map, as shown in the figure below (Source). For the a same position, its indice will be different in different CRSs. There are two formats of CRS. One is
EPSG and the other is
EPSG is a combination of numbers.
proj4string contains variables and values. You could find more information about them in this link.
After reviewing some concepts for geospatial data. Let’s write some codes in R.
Before start, let’s install the package first.
And import the package.
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
9.3 Read files
The file format to store the spatial data is shapefile. We could import the shapefile with
st_read() function into R.
The dataset is from the Minnesota Geospatial Commons. It contains the geospatial information of the census tracts of the Twin Cities metro area.
## Reading layer `Census2010RealignTract' from data source `C:\UMN\Course\R_course\PA5928-Data-management-and-visualization-with-R\Census2010RealignTract.shp' using driver `ESRI Shapefile' ## Simple feature collection with 708 features and 10 fields ## geometry type: POLYGON ## dimension: XY ## bbox: xmin: 419967.5 ymin: 4924224 xmax: 521254.7 ymax: 5029010 ## projected CRS: NAD83 / UTM zone 15N
After running the codes, the console has shown some information about the shapefile we just read.
##  "sf" "data.frame"
class(), we could check the type of the variable. It is both
sf means the dataset is a type of geospatial data and contains some related information.
data.frame means that the dataset is also a data frame and could be managed by many functions working on data frames.
## Simple feature collection with 6 features and 10 fields ## geometry type: POLYGON ## dimension: XY ## bbox: xmin: 427577.5 ymin: 4924224 xmax: 516478.2 ymax: 4949418 ## projected CRS: NAD83 / UTM zone 15N ## TRACT STATE CO TRACTCE10 GEOID ALAND10 AWATER10 Acres Shape_Leng ## 1 61502 27 037 061502 27037061502 149642919 4668565 38134.50 54449.32 ## 2 61402 27 037 061402 27037061402 211963640 3786753 53258.86 66036.81 ## 3 81200 27 139 081200 27139081200 90667239 3288582 23205.13 38782.40 ## 4 61501 27 037 061501 27037061501 168761402 1932931 42132.88 65971.67 ## 5 81100 27 139 081100 27139081100 183756089 3059699 46115.82 58071.16 ## 6 81300 27 139 081300 27139081300 218177258 3688572 55189.04 94116.03 ## Shape_Area geometry ## 1 154324840 POLYGON ((477636.4 4932330,... ## 2 215530975 POLYGON ((496883.5 4929115,... ## 3 93907812 POLYGON ((448700 4932487, 4... ## 4 170505702 POLYGON ((496863.3 4932245,... ## 5 186624108 POLYGON ((458497.9 4932388,... ## 6 223342132 POLYGON ((439064.1 4943284,...
head(), we could see that there is a column called
sf file will have this column, where stores the geospatial information of the observations. Other columns are just normal columns containing information for the observations.
We could use
st_crs() to check the CRS of the variable.
## Coordinate Reference System: ## User input: NAD83 / UTM zone 15N ## wkt: ## PROJCRS["NAD83 / UTM zone 15N", ## BASEGEOGCRS["NAD83", ## DATUM["North American Datum 1983", ## ELLIPSOID["GRS 1980",6378137,298.257222101, ## LENGTHUNIT["metre",1]]], ## PRIMEM["Greenwich",0, ## ANGLEUNIT["degree",0.0174532925199433]], ## ID["EPSG",4269]], ## CONVERSION["UTM zone 15N", ## METHOD["Transverse Mercator", ## ID["EPSG",9807]], ## PARAMETER["Latitude of natural origin",0, ## ANGLEUNIT["Degree",0.0174532925199433], ## ID["EPSG",8801]], ## PARAMETER["Longitude of natural origin",-93, ## ANGLEUNIT["Degree",0.0174532925199433], ## ID["EPSG",8802]], ## PARAMETER["Scale factor at natural origin",0.9996, ## SCALEUNIT["unity",1], ## ID["EPSG",8805]], ## PARAMETER["False easting",500000, ## LENGTHUNIT["metre",1], ## ID["EPSG",8806]], ## PARAMETER["False northing",0, ## LENGTHUNIT["metre",1], ## ID["EPSG",8807]]], ## CS[Cartesian,2], ## AXIS["(E)",east, ## ORDER, ## LENGTHUNIT["metre",1]], ## AXIS["(N)",north, ## ORDER, ## LENGTHUNIT["metre",1]], ## ID["EPSG",26915]]
Luckily, this variable contains the right CRS. For the maps in the area of Minnesota, the CRS is the one showing above, 26915 for
However, sometimes, the file does not have CRS information. You have to check the original source of the file and assign the right CRS to it.
st_crs(map) <- 26915
By doing this, there is a message telling that this function will not do the transformation for us. Sometimes, even if the file contains the right CRS, its CRS might be different from other files. If you want to map them in the same page or want to do geospatial data analysis among them, you have to transfer the CRSs of them to the same one. You could do this by using
We could use
ggplot2 to visualize the map. For simply mapping the polygons, we do not need to specify
aes(). And we use
geom_sf() to plot map.
library(ggplot2) ggplot(map) + geom_sf()
9.4 Deal with spatial data like data frame
As the spatial data is also a data frame in R. We could use
dplyr to deal with it. For example, select the column we want.
The current spatial data
map contains some variables we do not need, so we could select the useful ones by using
select() function in
dplyr. Please pay attention that we do not need to select the geometry column. It will be selected automatically. In this lecture, we only select GEOID, which is the ID for each area/tract.
library(dplyr) <- map %>% new_map select(GEOID)
We could join other data to the current geospatial variable by the join functions in
dplyr. Firstly, we import the data.
library(readr) <- read_csv('data.csv')data
## ## -- Column specification -------------------------------------------------------- ## cols( ## GEOID = col_double(), ## POPTOTAL = col_double(), ## HHTOTAL = col_double(), ## AGEUNDER18 = col_double(), ## AGE18_39 = col_double(), ## AGE40_64 = col_double(), ## AGE65UP = col_double() ## )
This dataset also has a variable called GEOID, which could be used to do the join operation. The list below presents the descriptions of other variables, which are all from 2012-2016 five-year ACS estimates.
|GEOID||Unique identifier used by Census FactFinder website|
|AGEUNDER18||Population age under 18|
|AGE18_39||Population in this range|
|AGE40_64||Population in this range|
##  "factor"
##  "numeric"
We could join this data to the map by using the GEOID column. Before join operation, we have to transfer the GEOID to match the data type of GEOID in the map file. Currently, their types are different, one is factor and other is numeric. They have to be the same data type before joining.
<- data %>% data mutate(GEOID = factor(GEOID))
Then, we do the join operation.
<- new_map %>% new_map left_join(data, by = 'GEOID')
This message tells us that the levels of factors in the GEOIDs of the two files are different. This makes sense, since the data file has more observations, which also will have more levels.
9.5 Data management
We calculate the percentage of old people in each census tract.
<- new_map %>% new_map mutate(old_percent = AGE65UP/POPTOTAL)
9.6 Spatial data visualization
We could use
ggplot2 by its
<- ggplot(new_map, aes(fill = old_percent)) + # use fill to indicate the variable you want to visualize p geom_sf() p
The color is not good, we could add
scale_fill_gradient() to specify the colours we want.
+ scale_fill_gradient(low = 'white', high = 'red')p
Again, we could add more information to make the figure better and more readable, and change the theme a little bit.
+ scale_fill_gradient(low = 'white', high = 'red') + p labs(title = 'Distribution of old population in the Twin Cities area', fill = 'Percentage') + theme_bw()
Finally, we could use
ggsave() to save the plot we want after we run the
9.7 More example
We could visualize the population directly in the map with
ggplot(new_map, aes(fill = POPTOTAL)) + geom_sf(colour = 'White') + scale_fill_gradient(low = 'white', high = 'Orange') + labs(title = 'Distribution of population in the Twin Cities area', fill = 'Population') + theme_bw()
9.8 Add other layers to the current plot
We could add other layers to the current plot. We import the spatial data of the transit routes in the metro area.
## Reading layer `TransitRoutes' from data source `C:\UMN\Course\R_course\PA5928-Data-management-and-visualization-with-R\TransitRoutes.shp' using driver `ESRI Shapefile' ## Simple feature collection with 227 features and 32 fields ## geometry type: MULTILINESTRING ## dimension: XY ## bbox: xmin: 409591 ymin: 4947433 xmax: 515317.7 ymax: 5046090 ## projected CRS: NAD83 / UTM zone 15N
We add the transit route layers to the current plot by adding one more
ggplot(new_map) + geom_sf() + geom_sf(data = transit_route)
Add titles and themes to it.
ggplot(new_map) + geom_sf(fill = NA, colour = 'Grey') + geom_sf(data = transit_route, colour = 'Blue') + labs(title = 'Transit route in the Twin Cities Metro area') + theme_bw()
9.9 Geospatial data analysis
For those who are familiar with geospatial operations in ArcGIS,
sf also provides functions to do geospatial data analysis, such as spatial join, intersect, clip, etc. For example, you could use
st_join() to carry out spatial join, use
st_intersects() to do intersect, and
st_intersection() to do clip. You could find more here.