<< Workshop Home

Geocoding

What is Geocoding?

GPS coordinates can easily be located on a map. However, sometimes our spatial reference is a street address or only the name of a place. It turns out that we can often use these data to find a location on a map. The process of converting an address or place name into map coordinates is called geocoding. This is what Google Maps does when you type in an address and it shows a location on a map. Within R, you can use Google’s mapping database to geocode your data.

Extracting Geographic Data from PDF Reports

You can find lots of useful spatial data online, but rarely will you be lucky enough to find it in shapefile format. More often, it will be embedded in a PDF. You will need to do some work to first get it into a table or spreadsheet that can be read by R. From there, you can geocode.

A PDF report of the National Survey of HIV and Syphilis Prevalence among Women attending Antenatal Clinics in Zimbabwe 2012 is available online. The report includes geographic data on each site, including the name of the site and the district it is located within. However, there are no GPS coordinates or maps included in the report. We’ll see how R can be used to geocode the locations of these facilities and place them on a map.

Of course, first you need to convert the PDF into a spreadsheet. If you have an Adobe license this can be done fairly easily. Otherwise it requires some work: highlight and copy the table, and paste it into an Excel spreadsheet. You’ll see that all the data ends up into a single column. To separate the data, click the Data tab, then click Text to Columns. Choose Delimited. Check Space. Click Finish.

Text to Columns in Excel

Text to Columns in Excel

The data still will not be properly organized, so you’ll have to manually move things around until you get the data sorted in columns. Luckily, I already did this for you! I extracted all of the 2012 data from the report and organized it in ANC2012.xlsx.

Geocoding in R with Google Maps API

Now we are ready to geocode in R. We’ll need the ggmap package to do this. We also need a package for reading Excel files into R. If you’ve closed R since starting this tutorial, you’ll also need to load the rgdal package again.

# Set working directory

# Load packages
library(rgdal) # 'rgdal' automatically loads 'sp' package along with it
library(ggmap) # has a geocoding function
library(readxl) # for reading excel files

# Make sure prettymapr is NOT loaded; it also has a geocode() function
# Our code is written for ggmap's geocode() function
detach(package:prettymapr)

# Options
options(stringsAsFactors = F)

Import the ANC data using the read_excel() function from the readxl package and assign the data to an object called ANC2012.

# Import ANC data
ANC2012 <- read_excel("ANC2012.xlsx")
# Examine the data
class(ANC2012)
## [1] "tbl_df"     "tbl"        "data.frame"
head(ANC2012)
## # A tibble: 6 × 23
##      province    district      site                    name tested.total tested.positive prevalence prev.ci_lower prev.ci_upper res.total_n res.farm_n res.farm_pct res.growthpt_n res.growthpt_pct res.mine_n res.mine_pct res.rural_n res.rural_pct res.urban_n res.urban_pct res.missing_n res.missing_pct classification
##         <chr>       <chr>     <chr>                   <chr>        <dbl>           <dbl>      <dbl>         <dbl>         <dbl>       <dbl>      <dbl>        <dbl>          <dbl>            <dbl>      <dbl>        <dbl>       <dbl>         <dbl>       <dbl>         <dbl>         <dbl>           <dbl>          <chr>
## 1    Bulawayo    Bulawayo Nkulumane        Nkulumane Clinic          596             102       17.1          14.1          20.1         600          1          0.2              2              0.3          0          0.0           7           1.2         581          96.8             9             1.5          Urban
## 2 Chitungwiza Chitungwiza  St Marys         St Marys Clinic          600              67       11.2           8.6          13.7         600          5          0.8              3              0.5          0          0.0           9           1.5         582          97.0             1             0.1          Urban
## 3      Harare      Harare  Glenview                Glenview          333              34       10.2           6.9          13.5         334          0          0.0              1              0.3          0          0.0           0           0.0         328          98.2             5             1.5          Urban
## 4      Harare      Harare Hatcliffe        Hatcliffe Clinic          339              33        9.7           6.5          12.9         340          5          1.5              6              1.8          1          0.3           9           2.7         317          93.2             1             0.3          Urban
## 5      Harare      Harare Kuwadzana        Kuwadzana Clinic          599              70       11.7           9.1          14.3         600          4          0.7              0              0.0          0          0.0           0           0.0         593          98.8             3             0.5          Urban
## 6  Manicaland      Mutasa     Hauna Hauna District Hospital          224              23       10.3           6.2          14.3         224         23         10.3             17              7.6          8          3.6         165          73.7          10           4.5             1             0.5          Rural

Check out the geocode() functionality.

# Read the manual for the geocode function
?geocode

From the Help documentation, we learn that the Google Maps limits to 2500 queries per day. We see that we can use geocodeQueryCheck() to determine how many queries remain today.

#Check how many free geocodes we have 
geocodeQueryCheck() #2500 per day limit
## 2500 geocoding queries remaining.

The Help documentation also informed us that we need to give the geocode() function a single “character vector” for the location input. It will be helpful to include not just the site, but also the province and country. Let’s string those together using the paste0() function.

# Make a new string that has all the place details together
place<- paste0(ANC2012$site, ", ", ANC2012$province, ", ", "Zimbabwe")
head(place)
## [1] "Nkulumane, Bulawayo, Zimbabwe"   "St Marys, Chitungwiza, Zimbabwe" "Glenview, Harare, Zimbabwe"      "Hatcliffe, Harare, Zimbabwe"     "Kuwadzana, Harare, Zimbabwe"     "Hauna, Manicaland, Zimbabwe"

Note that we are using the site variable to locate the place. Alternatively, we could include the name variable to get the more precise location of the health facility. Google knows the locations of many health facilities in Zimbabwe by name. However, Google might not find the location if it is too specific, so we would have to do some cleaning and run a few rounds of geocoding until all locations could be found. For now, let’s just use site to get the approximate location. This will be adequate for a national-level static map.

# Now try geocoding
geocode_result <- geocode(place, output = "more")

You’ll see some red text start to appear in the Console. This is nothing to worry about; it just shows that queries are running. When the red text stops appearing, you’ll know the geocoding is finished. Check the red text results to make sure no errors occurred. View the results.

head(geocode_result)
##   X      lon       lat          type     loctype                                    address     north     south     east     west neighborhood political     locality administrative_area_level_2 administrative_area_level_1  country       route        bus_station establishment
## 1 1 28.50391 -20.18118  neighborhood approximate              nkulumane, bulawayo, zimbabwe -20.16063 -20.20023 28.53330 28.48167    Nkulumane Nkulumane     Bulawayo                    Bulawayo Matabeleland South Province Zimbabwe        <NA>               <NA>          <NA>
## 2 2 31.06667 -17.98333  neighborhood approximate                      saint marys, zimbabwe -17.98198 -17.98468 31.06802 31.06532  Saint Marys      <NA>         <NA>                        <NA>   Mashonaland East Province Zimbabwe        <NA>               <NA>          <NA>
## 3 3 30.94747 -17.90892     political approximate                glen view, harare, zimbabwe -17.88817 -17.93338 30.96954 30.92469         <NA> Glen View       Harare                      Harare             Harare Province Zimbabwe        <NA>               <NA>          <NA>
## 4 4 31.10473 -17.69284 establishment approximate   485 hatcliffe, 20th st, harare, zimbabwe -17.69149 -17.69419 31.10608 31.10339         <NA>      <NA>       Harare                        <NA>                        <NA> Zimbabwe 20th Street               <NA>          <NA>
## 5 5 30.91304 -17.83039     political approximate                kuwadzana, harare, zimbabwe -17.81736 -17.84433 30.94282 30.89076         <NA> Kuwadzana       Harare                      Harare             Harare Province Zimbabwe        <NA>               <NA>          <NA>
## 6 6 32.84917 -18.49885   bus_station approximate hauna growth point, honde valley, zimbabwe -18.49750 -18.50020 32.85052 32.84782         <NA>      <NA> Honde Valley                      Mutasa         Manicaland Province Zimbabwe        <NA> Hauna Growth Point          <NA>

We see that Google reported back numerous details. If we only wanted the longitude and latitude, we could have left out the “more” option in our geocode query. However, details are useful for checking the accuracy of the geocoding results. Let’s check whether Google located these points within the same districts that were reported in the ANC survey report.

# Quality check
cbind(ANC2012$district, geocode_result$administrative_area_level_2)
##       [,1]              [,2]            
##  [1,] "Bulawayo"        "Bulawayo"      
##  [2,] "Chitungwiza"     NA              
##  [3,] "Harare"          "Harare"        
##  [4,] "Harare"          NA              
##  [5,] "Harare"          "Harare"        
##  [6,] "Mutasa"          "Mutasa"        
##  [7,] "Buhera"          "Buhera"        
##  [8,] "Chimanimani"     "Chimanimani"   
##  [9,] "Nyanga"          "Nyanga"        
## [10,] "Makoni"          "Makoni"        
## [11,] "Mutare"          "Mutare"        
## [12,] "Bindura"         "Bindura"       
## [13,] "Rushinga"        "Rushinga"      
## [14,] "Mazowe"          "Mazowe"        
## [15,] "Guruve"          "Guruve"        
## [16,] "Mt Darwin"       "Mount Darwin"  
## [17,] "Shamva"          "Shamva"        
## [18,] "Centenary"       "Centenary"     
## [19,] "Seke"            NA              
## [20,] "Hwedza"          "Wedza"         
## [21,] "Goromonzi"       "Goromonzi"     
## [22,] "Marondera"       "Marondera"     
## [23,] "Murehwa"         "Murehwa"       
## [24,] "Mutoko"          "Mutoko"        
## [25,] "Chikomba"        "Chikomba"      
## [26,] "Zvimba"          "Zvimba"        
## [27,] "Makonde"         "Makonde"       
## [28,] "Kadoma"          "Kadoma"        
## [29,] "Kariba"          "Kariba"        
## [30,] "Hurungwe"        "Hurungwe"      
## [31,] "Chiredzi"        "Chiredzi"      
## [32,] "Chivi"           "Chivi"         
## [33,] "Gutu"            "Gutu"          
## [34,] "Masvingo"        "Masvingo"      
## [35,] "Zaka"            "Zaka"          
## [36,] "Mwenezi"         "Chivi"         
## [37,] "Bikita"          "Bikita"        
## [38,] "Binga"           "Binga"         
## [39,] "Hwange"          "Hwange"        
## [40,] "Bubi"            "Bubi"          
## [41,] "Nkayi"           "Nkayi"         
## [42,] "Umguza"          "Umguza"        
## [43,] "Beitbridge"      "Beitbridge"    
## [44,] "Insiza"          "Insiza"        
## [45,] "Gwanda"          "Gwanda"        
## [46,] "Bulilima/Mangwe" "Bulilimamangwe"
## [47,] "Gokwe"           "Gokwe South"   
## [48,] "Gweru"           "Gweru"         
## [49,] "Kwekwe"          "Kwekwe"        
## [50,] "Mberengwa"       "Mberengwa"     
## [51,] "Chirumanzu"      "Chirumhanzu"   
## [52,] "Shurugwi"        "Shurugwi"      
## [53,] "Zvishavane"      "Zvishavane"

Looks pretty good! There are a few NA results where Google didn’t report the district, but if you further inspect the other fields you’ll see that they seem to be in the right area.

Now that we have latitude and longitude, we can convert to a shapefile and map! First, we need to merge the geocoding results with the ANC data. Let’s also export this data frame to a CSV (comma separated values) spreadsheet file so we can access the geocoded data outside of R.

# Merge ANC data with geocoding results
ANC2012_geocoded <- cbind(ANC2012, geocode_result)
head(ANC2012_geocoded)
##      province    district      site                    name tested.total tested.positive prevalence prev.ci_lower prev.ci_upper res.total_n res.farm_n res.farm_pct res.growthpt_n res.growthpt_pct res.mine_n res.mine_pct res.rural_n res.rural_pct res.urban_n res.urban_pct res.missing_n res.missing_pct classification X      lon       lat          type     loctype                                    address     north     south     east     west neighborhood political     locality
## 1    Bulawayo    Bulawayo Nkulumane        Nkulumane Clinic          596             102       17.1          14.1          20.1         600          1          0.2              2              0.3          0          0.0           7           1.2         581          96.8             9             1.5          Urban 1 28.50391 -20.18118  neighborhood approximate              nkulumane, bulawayo, zimbabwe -20.16063 -20.20023 28.53330 28.48167    Nkulumane Nkulumane     Bulawayo
## 2 Chitungwiza Chitungwiza  St Marys         St Marys Clinic          600              67       11.2           8.6          13.7         600          5          0.8              3              0.5          0          0.0           9           1.5         582          97.0             1             0.1          Urban 2 31.06667 -17.98333  neighborhood approximate                      saint marys, zimbabwe -17.98198 -17.98468 31.06802 31.06532  Saint Marys      <NA>         <NA>
## 3      Harare      Harare  Glenview                Glenview          333              34       10.2           6.9          13.5         334          0          0.0              1              0.3          0          0.0           0           0.0         328          98.2             5             1.5          Urban 3 30.94747 -17.90892     political approximate                glen view, harare, zimbabwe -17.88817 -17.93338 30.96954 30.92469         <NA> Glen View       Harare
## 4      Harare      Harare Hatcliffe        Hatcliffe Clinic          339              33        9.7           6.5          12.9         340          5          1.5              6              1.8          1          0.3           9           2.7         317          93.2             1             0.3          Urban 4 31.10473 -17.69284 establishment approximate   485 hatcliffe, 20th st, harare, zimbabwe -17.69149 -17.69419 31.10608 31.10339         <NA>      <NA>       Harare
## 5      Harare      Harare Kuwadzana        Kuwadzana Clinic          599              70       11.7           9.1          14.3         600          4          0.7              0              0.0          0          0.0           0           0.0         593          98.8             3             0.5          Urban 5 30.91304 -17.83039     political approximate                kuwadzana, harare, zimbabwe -17.81736 -17.84433 30.94282 30.89076         <NA> Kuwadzana       Harare
## 6  Manicaland      Mutasa     Hauna Hauna District Hospital          224              23       10.3           6.2          14.3         224         23         10.3             17              7.6          8          3.6         165          73.7          10           4.5             1             0.5          Rural 6 32.84917 -18.49885   bus_station approximate hauna growth point, honde valley, zimbabwe -18.49750 -18.50020 32.85052 32.84782         <NA>      <NA> Honde Valley
##   administrative_area_level_2 administrative_area_level_1  country       route        bus_station establishment
## 1                    Bulawayo Matabeleland South Province Zimbabwe        <NA>               <NA>          <NA>
## 2                        <NA>   Mashonaland East Province Zimbabwe        <NA>               <NA>          <NA>
## 3                      Harare             Harare Province Zimbabwe        <NA>               <NA>          <NA>
## 4                        <NA>                        <NA> Zimbabwe 20th Street               <NA>          <NA>
## 5                      Harare             Harare Province Zimbabwe        <NA>               <NA>          <NA>
## 6                      Mutasa         Manicaland Province Zimbabwe        <NA> Hauna Growth Point          <NA>
class(ANC2012_geocoded)
## [1] "data.frame"
# Export as a CSV to access outside of R
write.csv(ANC2012_geocoded, file="ANC2012_geocoded.csv")

NOTE: If you were NOT able to successfully geocode, you should download this file to use for the remainder of the workshop. Save it in your Spatial Workshop folder on your Desktop and import it with the following code: ANC2012_geocoded <- read.csv("ANC2012_geocoded.csv")

# Convert to shapefile
ANC_shp <- ANC2012_geocoded
coordinates(ANC_shp) <- ~lon + lat
class(ANC_shp)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
# Import provinces shapefile
provinces_shp <- readOGR("zwe_polbnda_adm1_250k_cso.shp")
## OGR data source with driver: ESRI Shapefile 
## Source: "zwe_polbnda_adm1_250k_cso.shp", layer: "zwe_polbnda_adm1_250k_cso"
## with 10 features
## It has 5 fields
# Plot ANC sites with provinces
plot(provinces_shp, col = "azure", main = "Zimbabwe ANC Sentinel Sites 2012")
plot(ANC_shp, add = T, pch = 15)
text(ANC_shp$lon, ANC_shp$lat - 0.1, labels = ANC_shp$site, cex = 0.8)

It’s difficult to make out all of the points clustered in Harare, so let’s create a map of just the Harare province. We can use square brackets to indicate the subset of provinces_shp we want to select. For the labels, subtract a small value (0.01) from the latitude so that the label will appear just below the point.

# Plot only Harare
plot(provinces_shp[provinces_shp$PROVINCE=="Harare", ],
     col = "azure", main = "Harare ANC Sentinel Sites 2012")
plot(ANC_shp, add = T, pch = 15)
text(ANC_shp$lon, ANC_shp$lat - 0.01, labels = ANC_shp$site, cex = 0.8)

<< Workshop Home