dyerlab - Coagulating Samples By Hydrologic Units; or‘How I learned to HUC my data…’

The Watershed Boundary Dataset (WBD) maps the full areal extent of surface water drainage for the U.S. using a hierarchical system of nesting hydrologic units at various scales, each with an assigned hydrologic unit code (HUC). HUCs are delineated and georeferenced to U.S. Geological Survey (USGS) 1:24,000 scale topographic base maps according to compilation criteria monitored by the national Subcommittee on Spatial Water Data.

For this exercise I’m going to use a the hydrologic regions that represent “large river basins” designated as HUC4. So, how do we get these spatial extents?

We could go to the WBD website and download fileGDB for each state and then go from there, or…
We could look and see that someone in the R community has already done this for us and we can use their package.

I’m going to go with the second option.

Getting HUC Representation

So for this, I found a utility package by Mazama Science that has a function getHUCName(lat,lon) that seems to be exactly what I need. So, I’ll use this. If you do not have this package already installed, execute the following:

install.packages("MAZAMASpatialUtils")

Then load it into your workspace as:

library(MazamaSpatialUtils)

Loading required package: sf

Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE

The library comes with a few spatial data sets already installed, which you can see:

setSpatialDataDir("./data/")
installedSpatialData()

   SimpleCountries -- Simplified version of the TMWorldBorders 
SimpleCountriesEEZ -- Simplified version of EEZCountries 
   SimpleTimezones -- Simplified version of WorldTimezones

unfortunately, this does not include the WBD data set for HUC4, so we’ll have to install it locally. To do this, we designate a location for them to be stored (n.b., they are semi-large entities) and I’ll use a folder I created right under this file named (aptly) data and then have it installed there. It took me a bit of snooping around to find the right location of the data sets. As of the writing of this document, the folder is here and you can go look in there to see the coding. The WBD files are not part of the documentation of the package (at least in the help file for the getHUCName() or the vignettes).

installSpatialData( dataset="WBDHU4" )

Now, we can verify that the HUC4 data is installed.

installedSpatialData()

   SimpleCountries -- Simplified version of the TMWorldBorders 
SimpleCountriesEEZ -- Simplified version of EEZCountries 
   SimpleTimezones -- Simplified version of WorldTimezones 
            WBDHU4 -- Watershed boundary level-4 hydrologic units 
         WBDHU4_01 -- Watershed boundary level-4 hydrologic units (simplified to 1%)
         WBDHU4_02 -- Watershed boundary level-4 hydrologic units (simplified to 2%)
         WBDHU4_05 -- Watershed boundary level-4 hydrologic units (simplified to 5%)

Now, we can figure out what HUC4 is for a specific locale. We need to first load in the data set and then we can query values (here is the designation for Richmond VA).

loadSpatialData("WBDHU4")
getHUCName( longitude = -77.4360, latitude = 37.5407, dataset = "WBDHU4")

[1] "Lower Chesapeake"

Cool (though it is a bit slow).

Applying to Genetic Data Sets

So let’s apply this to the larger turtle data set. Here I have a 1400 indivduals genotyped for 930 SNP loci. The meta data for each individual is

load("all_data.rda")
names(data)[1:6]

[1] "Locale"     "Population" "ID"         "Target"     "Longitude" 
[6] "Latitude"

Where the values in Population are the ones that I’m going to collapse samples into. So, what I want is to be able to find the HUC-4 designation for all Population values.

So, there are some functions in my gstudio package that will help out. The strata_coordinates() function takes the barycenter of a set of samples and returns a data.frame with the coordinates.

library( gstudio )
coords <- strata_coordinates( data, stratum = "Population")
names(coords)

[1] "Stratum"   "Longitude" "Latitude"

The function seems to take a bit of time to run. For example, to do the single coordinate above, on my MacStudio desktop, it takes.

start <- Sys.time()
getHUCName( longitude = -77.4360, latitude = 37.5407, dataset = "WBDHU4")

[1] "Lower Chesapeake"

end <- Sys.time()
print( end - start )

Time difference of 6.319809 secs

It turns out that at the current speed, it should take (nrow(coords) * difftime(end,start,units="mins")), which is about perfect for me to go grab a cup of tea while it is running.

coords$HUC4 <- factor( getHUCName( longitude = coords$Longitude,
                           latitude = coords$Latitude,
                           dataset = "WBDHU4") )

which I can now map back onto my original data file.

library( tidyverse )
data |> 
  left_join(  coords |>  select( Population = Stratum, HUC4) ) |>
  select( Locale, HUC4 ) |> 
  summary()

    Locale                                          HUC4    
 Length:1318        Delaware-Mid Atlantic Coastal     :159  
 Class :character   Massachusetts-Rhode Island Coastal:101  
 Mode  :character   Maine Coastal                     : 81  
                    Chowan-Roanoke                    : 76  
                    Lower Chesapeake                  : 76  
                    (Other)                           :805  
                    NA's                              : 20

Now I can clean up and get rid of the data sets in my blog folder.

unlink("./data/*")

Perfect. Hope this helps.