<< Workshop Home

Mapping Shapefiles

# Set Working Directory
setwd("/Users/Carrie/Desktop/Spatial Workshop") #replace with your directory

Loading Packages

If you want to use an installed package, you will need to load the package using the library() function. Each time you close R and then re-open it, you will need to re-load the packages you want to use.

# Load packages for this working session
library(sp)
library(rgdal)

The ? function is useful for looking up documentation on packages or code. The results will display in the Help tab within R Studio. If you do not find what you are looking for, remember that you can find lots of help by searching online.

# What do these packages do?
?sp #provides classes & methods for spatial data
?rgdal #helps to read shapefiles

Importing Shapefiles

There are many different commands to import data in R, depending on the type of data. We downloaded health facilities and provinces as shapfiles, which means they are spatial vector data. Shapefiles are pretty easy to work with. Use the readOGR() function from the rgdal package to import shapefiles (files ending in .shp). R can easily find the file by name because you set your working directory to the location of the file. Assign your shapefile to an object called facilities_shp, using the <- notation.

# Options
options(stringsAsFactors = FALSE) #import text variables as characters

# Import the health facilities shapefile
facilities_shp <- readOGR("zwe_health_facility_mohcw.shp")

Examining Data

It’s always a good idea to examine your data after you import it. In particular, we need to know what class of data we are working with. Different classes of data require different commands.

# Class of data
class(facilities_shp)

We see that facilities_shp is a SpatialPointsDataFrame, which means it contains point vector data which can be readily mapped. Try it out with R’s built-in plot() function!

# Plot as a map
plot(facilities_shp)
# View the dataset
View(facilities_shp) #View opens a new tab showing the data as a spreadsheet
# Look at the first few rows of data in the Console
head(facilities_shp)

It appears that our health facilities shapefile came with lots of data about each facility! There are variables including its name, year built, type of facility, number of doctors, number of nurses, etc! Unfortunately it appears that many of these values are missing, judging by all of those zeros. It looks like the data have not been updated since 1998 for these facilities.

# Look at the last few rows of data
tail(facilities_shp)

At the end of the dataset, the updated year is more recent (2007), and there are some values filled in for doctors, nurses, beds, etc. Exciting! Now let’s take a look at the data for some of these variables.

TIP! If you have ever used Stata, you should note a key difference in R is that you can easily work with numerous datasets at a time. So, when you want to access a variable, you need to tell R which dataset the variable is in. This is done using the $ symbol.

# Year updated
table(facilities_shp$UPDATED)

We see that there are two years for the UPDATED variable: 1998 and 2007. Most of the facilities (1,212 of them) were updated in 1998, while 258 facilities were updated in 2007.

# Number of doctors
table(facilities_shp$NUMOFDOCTO)

We see that most facilities have 0 recorded for the number of doctors. 15 facilities have 1 doctor, 5 facilities have 2 doctors, and 2 facilities have 3 doctors recorded. It’s likely that many of the 0 values should actually be recorded as missing (NA).

# Number of nurses
table(facilities_shp$NUMOFNURSE)

Again, most of the facilities have 0 recorded for the number of nurses. However, there are more facilities with at least 1 nurse, and 1 facility even has 41 nurses recorded.

# Number of beds
summary(facilities_shp$NUMOFBEDS)

Using the summary command, we can see that at least three-quarters of facilities have 0 beds recorded. The maximum number of beds recorded is 150.

# Types of facilities
table(facilities_shp$TYPEOFFACI)

There are so many facility types that it is difficult to tell which are most common. Why are they not sorted by order of frequency like they were for doctors, nurses, and beds?

# Check class of data
class(facilities_shp$TYPEOFFACI)

After checking the class, we see that facility types are “character” data, meaning they are made up of letter or symbols rather than numbers (called “string” in Stata). So when we use the table() function, R sorts by alphabetical letter. This is helpful for spotting some of the data cleaning needs, but we need to give R special instructions if we want to order by frequency.

#Sort in descending order
sort(table(facilities_shp$TYPEOFFACI),decreasing=TRUE)

What is the most common type of facility?

Great! Now we have a feel for our facilities data. But all of those points mapped in a blank space are not very helpful for getting a sense of where the facilities are located.

# Import the provinces shapefile
provinces_shp <- readOGR("zwe_polbnda_adm1_250k_cso.shp")

# Class of data
class(provinces_shp)

Again, we have spatial vector data, but this time the data are polygons. Let’s see what they look like.

# Plot as a map
plot(provinces_shp)

Neat! But where did our facilities go? We need to add them on top of the provinces.

# Plot facilities on top of provinces
plot(provinces_shp)
plot(facilities_shp, add=TRUE) #add=TRUE adds to previous plot

Selecting a Subset of a Shapefile

Now that we have examined our data and determined that much of the data are out of date, moving forward we just want to use the facilities that were updated more recently. Let’s look at what proportion of the total facilities we will be keeping if we do that.

# Examine the data again, this time with a proportion table
prop.table(table(facilities_shp$UPDATED))

With the prop.table() (proportion table) function, we see that 17.55% of facilities that were updated in 2007. Let’s select those facilities using the subset() function:

  • We specify that within the dataset facilities_shp, we want to keep all records where the variable UPDATED is equal to 2007.
  • We assign this subset to a new object called facilities2007_shp.
# Select facilities updated in 2007
facilities2007_shp <- subset(facilities_shp, UPDATED==2007)

Now let’s plot our new shapefile to look at the spatial distribution of facilities with data updated in 2007.

# Plot
plot(provinces_shp)
plot(facilities2007_shp, add=TRUE)

What do you notice? Which provinces did not have any facilities included in the 2007 data update?

Alright, let’s examine the data in our 2007 facilities shapefile. This time we’ll construct a more nicely formatted output with columns for the frequency and proportion of facilities by type. We do this by creating two new objects we’ll call Freq and Pct, and binding them together with the cbind() function. We’ll also use the head() function again, this time specifying that we want to see the first 15 rows.

# Within the recently updated facilities, select the clinics, RHCs, and hospitals.
# Explore the types of facilities
Freq <- sort(table(facilities_shp$TYPEOFFACI), decreasing=TRUE)
Pct <- prop.table(Freq)*100 #multiply by 100 for percentage format
Pct <- round(Pct, 2) # round the Pct values to two decimal points
head(cbind(Freq, Pct),15) #use head( ,15) to just look at first 15 rows of the table

We can see the most of the facilities are some type of clinic, rural health centre, or hospital. Let’s create three new shapefiles, one for each of these facility types. We’ll use the subset() function again. To indicate multiple values to include in a single subset, use the | symbol (meaning “or”). R is case sensitive, so we need to specify that we want both Clinic and clinic values in our new clinics2007_shp object.

# Create a new shapefile with just selected clinics
clinics2007_shp <- subset(facilities2007_shp,
                          TYPEOFFACI=="Clinic" | #this bar signifies "OR"
                          TYPEOFFACI=="clinic")

# Create a new shapefile for RHCs
RHCs2007_shp <- subset(facilities2007_shp,
                       TYPEOFFACI=="Rural Health Centre" |
                       TYPEOFFACI=="Rural Health centre")

# Create a new shapefile for Hospitals
hospitals2007_shp <- subset(facilities2007_shp,
                            TYPEOFFACI=="Hospital" |
                            TYPEOFFACI=="Mission Hospital" |
                            TYPEOFFACI=="Rural Hospital" |
                            TYPEOFFACI=="District Hospital" |
                            TYPEOFFACI=="Provincial Hospital")

Visualizing Qualitative Data

Now let’s map these three new shapefiles using different shapes and colors. Within the plot command, you can specify the following inputs:

  • col: a color
  • pch: a symbol code (1 to 24)
# Plot facilities by type using different shapes and colors
plot(provinces_shp)
plot(clinics2007_shp, add = T, col = "purple", pch = 19) #19=circle symbol
plot(RHCs2007_shp, add = T, col = "blue", pch = 15) #15=square symbol
plot(hospitals2007_shp, add = T, col = "red", pch = 17) #17=triangle symbol
# Tip: view more symbols and colors
?plot #see more plot options, including pch symbol options
colors() #see list of color options

Now add some more plot specifications.

  • Provinces polygons:
    • col: fill color
    • border: border color
    • main: title for the plot
  • Facilities points:
    • pch: this time, specify filled symbols
    • col: outline color
    • bg: fill color
    • alpha: transparency for the fill color
    • cex: font size (1 is default)
# Make it prettier
plot(provinces_shp, col = "grey93", border = "grey",
      main = "Zimbabwe Health Facilities Updated in 2007") #title of plot
plot(clinics2007_shp, add = T,
      pch = 21, #filled circle symbol
      col = "blue4", #outline color
      bg = adjustcolor("blue1", alpha = 0.6), #fill color, alpha=transparency
      cex = 1.5) # increase font size
plot(RHCs2007_shp, add = T,
      pch = 22, #filled square symbol
      col = "purple4",
      bg = adjustcolor("purple1", alpha = 0.7),
      cex = 1.5)
plot(hospitals2007_shp, add = T,
      pch = 24, #filled triangle symbol
      col = "red4",
      bg = adjustcolor("red1", alpha = 0.8),
      cex = 1.5)

# Assign plot to an object for easy modification!
plot_2007 <- recordPlot()

Adding a Legend

Now let’s create a legend. We’ll need a list of the labels we want to give to each symbol.

# Create a vector of labels for the legend
legendLab <- c("Clinic", "Rural Health Centre", "Hospital")
legendLab

# Plot with legend
plot_2007 # Render the saved plot
legend("bottomleft", legend=legendLab) # Add legend

We can see that just providing the labels was not sufficient. We also need to tell the legend() function the symbols and colors to go with each label.

# Create a vector of symbols for the legend
legendSym <- c(19, 15, 17)
# Create a vector of colors for the legend
legendCol <- c("blue", "purple", "red")

# Re-draw the plot and legend
plot_2007
legend("topleft", legend = legendLab,
       pch = legendSym, 
       col = legendCol,
       cex = 1, # font size
       pt.cex = 1.5, # increase the symbol size
       bty = "n") # turn off the legend border

# Save changes to plot
plot_2007 <- recordPlot()

Adding a Scale Bar and North Arrow

Now let’s add some finishing touches. Every finished map should include a scale bar and north arrow as references.

# Load the prettymapr package
library(prettymapr)

# Try adding these
plot_2007
addscalebar()
addnortharrow()

The default settings are not ideal for our map. The scale bar and legend are important, but should not distract from the map. Let’s investigate how we can customize these so they attract less attention.

# Check how to adjust the scale bar and north arrow
?addscalebar
?addnortharrow
# Try again
plot_2007
addscalebar(style = "ticks")
addnortharrow(cols = c("white", "grey50"), scale = 0.4)
# Save changes to plot
plot_2007 <- recordPlot()

<< Workshop Home

---
title: "Introduction to Mapping Spatial Data in R"
subtitle: "Part 1: Shapefiles"
output: 
  html_document:
    toc: true
    toc_float: 
      collapsed: false
      smooth_scroll: false
    toc_depth: 2
    number_sections: false 
    theme: lumen
    highlight: textmate
  html_notebook:
    toc: true
    toc_float: 
      collapsed: false
      smooth_scroll: false
    toc_depth: 2
    number_sections: false 
    theme: lumen
    highlight: textmate  

---

```{css}
code {
  white-space: pre !important;
  overflow-x: scroll !important;
  word-break: keep-all !important;
  word-wrap: initial !important;
  
}
body{ /* Normal  */
      font-size: 16px;
  }

```

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
options(width=500)
```

[**<< Workshop Home**](https://cafahey.github.io/spatial)

#Mapping Shapefiles

```{r, eval=FALSE}
# Set Working Directory
setwd("/Users/Carrie/Desktop/Spatial Workshop") #replace with your directory
```

##Loading Packages

If you want to use an installed package, you will need to load the package using the `library()` function. Each time you close R and then re-open it, you will need to re-load the packages you want to use.

```{r}
# Load packages for this working session
library(sp)
library(rgdal)
```

The `?` function is useful for looking up documentation on packages or code. The results will display in the Help tab within R Studio. If you do not find what you are looking for, remember that you can find lots of help by searching online.

```{r eval=FALSE}
# What do these packages do?
?sp #provides classes & methods for spatial data
?rgdal #helps to read shapefiles
```

##Importing Shapefiles

There are many different commands to import data in R, depending on the type of data. We downloaded health facilities and provinces as *shapfiles*, which means they are spatial *vector* data. Shapefiles are pretty easy to work with.
Use the `readOGR()` function from the rgdal package to import shapefiles (files ending in `.shp`). R can easily find the file by name because you set your working directory to the location of the file. Assign your shapefile to an object called `facilities_shp`, using the `<-` notation.

```{r}
# Options
options(stringsAsFactors = FALSE) #import text variables as characters

# Import the health facilities shapefile
facilities_shp <- readOGR("zwe_health_facility_mohcw.shp")
```

##Examining Data

It's always a good idea to examine your data after you import it. In particular, we need to know what class of data we are working with. Different classes of data require different commands.

```{r}
# Class of data
class(facilities_shp)
```

We see that facilities_shp is a `SpatialPointsDataFrame`, which means it contains point vector data which can be readily mapped. Try it out with R's built-in `plot()` function!

```{r}
# Plot as a map
plot(facilities_shp)
```


```{r eval=FALSE}
# View the dataset
View(facilities_shp) #View opens a new tab showing the data as a spreadsheet
```

```{r}
# Look at the first few rows of data in the Console
head(facilities_shp)
```

It appears that our health facilities shapefile came with lots of data about each facility! There are variables including its name, year built, type of facility, number of doctors, number of nurses, etc! Unfortunately it appears that many of these values are missing, judging by all of those zeros. It looks like the data have not been updated since 1998 for these facilities.

```{r}
# Look at the last few rows of data
tail(facilities_shp)
```

At the end of the dataset, the updated year is more recent (2007), and there are some values filled in for doctors, nurses, beds, etc. Exciting! Now let's take a look at the data for some of these variables.

**TIP!** If you have ever used Stata, you should note a key difference in R is that you can easily work with numerous datasets at a time. So, when you want to access a variable, you need to tell R which dataset the variable is in. This is done using the `$` symbol.

```{r}
# Year updated
table(facilities_shp$UPDATED)
```
We see that there are two years for the `UPDATED` variable: 1998 and 2007. Most of the facilities (1,212 of them) were updated in 1998, while 258 facilities were updated in 2007.

```{r}
# Number of doctors
table(facilities_shp$NUMOFDOCTO)
```
We see that most facilities have `0` recorded for the number of doctors. 15 facilities have 1 doctor, 5 facilities have 2 doctors, and 2 facilities have 3 doctors recorded. It's likely that many of the 0 values should actually be recorded as missing (`NA`).
```{r}
# Number of nurses
table(facilities_shp$NUMOFNURSE)
```
Again, most of the facilities have 0 recorded for the number of nurses. However, there are more facilities with at least 1 nurse, and 1 facility even has 41 nurses recorded.
```{r}
# Number of beds
summary(facilities_shp$NUMOFBEDS)
```
Using the summary command, we can see that at least three-quarters of facilities have 0 beds recorded. The maximum number of beds recorded is 150.
```{r}
# Types of facilities
table(facilities_shp$TYPEOFFACI)
```
There are so many facility types that it is difficult to tell which are most common. Why are they not sorted by order of frequency like they were for doctors, nurses, and beds?
```{r}
# Check class of data
class(facilities_shp$TYPEOFFACI)
```
After checking the class, we see that facility types are "character" data, meaning they are made up of letter or symbols rather than numbers (called "string" in Stata). So when we use the `table()` function, R sorts by alphabetical letter. This is helpful for spotting some of the data cleaning needs, but we need to give R special instructions if we want to order by frequency.
```{r}
#Sort in descending order
sort(table(facilities_shp$TYPEOFFACI),decreasing=TRUE)
```
*What is the most common type of facility?*

Great! Now we have a feel for our facilities data. But all of those points mapped in a blank space are not very helpful for getting a sense of where the facilities are located.

```{r}
# Import the provinces shapefile
provinces_shp <- readOGR("zwe_polbnda_adm1_250k_cso.shp")

# Class of data
class(provinces_shp)
```
Again, we have spatial vector data, but this time the data are polygons. Let's see what they look like.
```{r}
# Plot as a map
plot(provinces_shp)
```

Neat! But where did our facilities go? We need to add them on top of the provinces.
```{r}
# Plot facilities on top of provinces
plot(provinces_shp)
plot(facilities_shp, add=TRUE) #add=TRUE adds to previous plot
```

##Selecting a Subset of a Shapefile
Now that we have examined our data and determined that much of the data are out of date, moving forward we just want to use the facilities that were updated more recently. Let's look at what proportion of the total facilities we will be keeping if we do that.

```{r}
# Examine the data again, this time with a proportion table
prop.table(table(facilities_shp$UPDATED))
```

With the `prop.table()` (proportion table) function, we see that 17.55% of facilities that were updated in 2007. Let's select those facilities using the `subset()` function: 

* We specify that within the dataset `facilities_shp`, we want to keep all records where the variable `UPDATED` is equal to `2007`. 
* We assign this subset to a new object called `facilities2007_shp`. 

```{r}
# Select facilities updated in 2007
facilities2007_shp <- subset(facilities_shp, UPDATED==2007)
```

Now let's plot our new shapefile to look at the spatial distribution of facilities with data updated in 2007.

```{r}
# Plot
plot(provinces_shp)
plot(facilities2007_shp, add=TRUE)
```

*What do you notice? Which provinces did not have any facilities included in the 2007 data update?*

Alright, let's examine the data in our 2007 facilities shapefile. This time we'll construct a more nicely formatted output with columns for the frequency and proportion of facilities by type. We do this by creating two new objects we'll call `Freq` and `Pct`, and binding them together with the `cbind()` function. We'll also use the `head()` function again, this time specifying that we want to see the first 15 rows.

```{r}
# Within the recently updated facilities, select the clinics, RHCs, and hospitals.
# Explore the types of facilities
Freq <- sort(table(facilities_shp$TYPEOFFACI), decreasing=TRUE)
Pct <- prop.table(Freq)*100 #multiply by 100 for percentage format
Pct <- round(Pct, 2) # round the Pct values to two decimal points
head(cbind(Freq, Pct),15) #use head( ,15) to just look at first 15 rows of the table
```

We can see the most of the facilities are some type of clinic, rural health centre, or hospital. Let's create three new shapefiles, one for each of these facility types. We'll use the `subset()` function again. To indicate multiple values to include in a single subset, use the `|` symbol (meaning "or"). R is case sensitive, so we need to specify that we want both `Clinic` and `clinic` values in our new `clinics2007_shp` object.

```{r}
# Create a new shapefile with just selected clinics
clinics2007_shp <- subset(facilities2007_shp,
                          TYPEOFFACI=="Clinic" | #this bar signifies "OR"
                          TYPEOFFACI=="clinic")

# Create a new shapefile for RHCs
RHCs2007_shp <- subset(facilities2007_shp,
                       TYPEOFFACI=="Rural Health Centre" |
                       TYPEOFFACI=="Rural Health centre")

# Create a new shapefile for Hospitals
hospitals2007_shp <- subset(facilities2007_shp,
                            TYPEOFFACI=="Hospital" |
                            TYPEOFFACI=="Mission Hospital" |
                            TYPEOFFACI=="Rural Hospital" |
                            TYPEOFFACI=="District Hospital" |
                            TYPEOFFACI=="Provincial Hospital")
```

##Visualizing Qualitative Data
Now let's map these three new shapefiles using different shapes and colors. Within the plot command, you can specify the following inputs:

* `col`: a color
* `pch`: a symbol code (`1` to `24`)

```{r}
# Plot facilities by type using different shapes and colors
plot(provinces_shp)
plot(clinics2007_shp, add = T, col = "purple", pch = 19) #19=circle symbol
plot(RHCs2007_shp, add = T, col = "blue", pch = 15) #15=square symbol
plot(hospitals2007_shp, add = T, col = "red", pch = 17) #17=triangle symbol
```


```{r eval=FALSE}
# Tip: view more symbols and colors
?plot #see more plot options, including pch symbol options
colors() #see list of color options
```

Now add some more plot specifications.

  * Provinces polygons:
    * `col`: fill color
    * `border`: border color
    * `main`: title for the plot
  
  * Facilities points:
    * `pch`: this time, specify filled symbols 
    * `col`: outline color
    * `bg`: fill color
    * `alpha`: transparency for the fill color
    * `cex`: font size (`1` is default)
  
```{r}
# Make it prettier
plot(provinces_shp, col = "grey93", border = "grey",
      main = "Zimbabwe Health Facilities Updated in 2007") #title of plot
plot(clinics2007_shp, add = T,
      pch = 21, #filled circle symbol
      col = "blue4", #outline color
      bg = adjustcolor("blue1", alpha = 0.6), #fill color, alpha=transparency
      cex = 1.5) # increase font size
plot(RHCs2007_shp, add = T,
      pch = 22, #filled square symbol
      col = "purple4",
      bg = adjustcolor("purple1", alpha = 0.7),
      cex = 1.5)
plot(hospitals2007_shp, add = T,
      pch = 24, #filled triangle symbol
      col = "red4",
      bg = adjustcolor("red1", alpha = 0.8),
      cex = 1.5)

# Assign plot to an object for easy modification!
plot_2007 <- recordPlot()
```

##Adding a Legend

Now let's create a legend. We'll need a list of the labels we want to give to each symbol.

```{r}
# Create a vector of labels for the legend
legendLab <- c("Clinic", "Rural Health Centre", "Hospital")
legendLab

# Plot with legend
plot_2007 # Render the saved plot
legend("bottomleft", legend=legendLab) # Add legend
```

We can see that just providing the labels was not sufficient. We also need to tell the `legend()` function the symbols and colors to go with each label.

```{r}
# Create a vector of symbols for the legend
legendSym <- c(19, 15, 17)
# Create a vector of colors for the legend
legendCol <- c("blue", "purple", "red")

# Re-draw the plot and legend
plot_2007
legend("topleft", legend = legendLab,
       pch = legendSym, 
       col = legendCol,
       cex = 1, # font size
       pt.cex = 1.5, # increase the symbol size
       bty = "n") # turn off the legend border

# Save changes to plot
plot_2007 <- recordPlot()
```

##Adding a Scale Bar and North Arrow

Now let's add some finishing touches. Every finished map should include a scale bar and north arrow as references.

```{r}
# Load the prettymapr package
library(prettymapr)

# Try adding these
plot_2007
addscalebar()
addnortharrow()
```

The default settings are not ideal for our map. The scale bar and legend are important, but should not distract from the map. Let's investigate how we can customize these so they attract less attention.

```{r eval=FALSE}
# Check how to adjust the scale bar and north arrow
?addscalebar
?addnortharrow
```

```{r}
# Try again
plot_2007
addscalebar(style = "ticks")
addnortharrow(cols = c("white", "grey50"), scale = 0.4)
# Save changes to plot
plot_2007 <- recordPlot()
```

[**<< Workshop Home**](https://cafahey.github.io/spatial)
