library(dplyr)
library(exactextractr)
library(landscapemetrics)
library(readr)
library(sf)
library(stringr)
library(terra)
library(tidyr)
library(units)
library(viridis)
# load and inspect the landcover data
<- rast("data-raw/landcover_mcd12q1_umd_us-ga_2014-2022.tif")
landcover print(landcover)
#> class : SpatRaster
#> dimensions : 1174, 1249, 9 (nrow, ncol, nlyr)
#> resolution : 463, 463 (x, y)
#> extent : -8132528, -7553851, 3362724, 3906653 (xmin, xmax, ymin, ymax)
#> coord. ref. : +proj=sinu +lon_0=0 +x_0=0 +y_0=0 +R=6371007.181 +units=m +no_defs
#> source : landcover_mcd12q1_umd_us-ga_2014-2022.tif
#> names : 2014, 2015, 2016, 2017, 2018, 2019, ...
#> min values : 0, 0, 0, 0, 0, 0, ...
#> max values : 15, 15, 15, 15, 15, 15, ...
# map the data for 2022
plot(as.factor(landcover[["2022"]]),
main = "MODIS Landcover 2022",
axes = FALSE)
3 Environmental Variables
3.1 Introduction
Species distribution models work by finding associations between species occurrence or abundance and environmental variables. Using these relationships, it’s possible to predict the distribution in areas that aren’t sampled, provided we know the value of the environmental variables in these areas. Therefore, to proceed with the modeling in the next several chapters, we’ll need a suite of environmental variables to be used as predictors in our models. The particular set of environmental variables that’s most suitable for a given study will depend on the focal species, region, and time period, as well as the availability of data. When species distributions are well defined by the environmental variables, extrapolations to unsurveyed areas will be more accurate. So, it’s worth considering which variables are important for your species and region.
Fortunately, there are an abundance of freely available, satellite-based environmental datasets that are suitable for species distribution modeling. A small subset of possible data sources available globally includes data describing landcover, elevation, topography, surface water, and intertidal habitat. However, we encourage you to search for datasets suitable for their region and species of interest.
Since there are such a wide range of available environmental datasets, and the distribution mechanisms and formats for each are different and often changing, we will not cover the specifics of how to download and pre-processes satellite-derived data products. Instead, we have downloaded and prepared example landcover and elevation datasets and will demonstrate how environmental variables can be extracted from these datasets in the following sections. This will provide examples of assigning environmental variables based on both categorical (landcover) and continuous (elevation) raster data sets.
To gain access to the example raster datasets, download the data package for this guide by following the instructions in the Introduction.
3.2 Landcover
For the examples in this guide, we’ll use land cover variables derived from the MODIS MCD12Q1 v006 land cover product (Friedl and Sulla-Menashe 2015). This product has global coverage at 500m spatial resolution and annual temporal resolution from 2001-2022. These data are available for several different classification schemes. We’ll use the University of Maryland classification system, which provides a globally accurate classification of land cover in our experience. This system classifies pixels into one of 15 different land cover classes. The terra
package includes a nice tutorial for how to download and pre-processing MODIS data like this using the luna
R package.
A subset of the 2014-2022 data for our study region (i.e., Georgia) is in the data package. The file landcover_mcd12q1_umd_us-ga_2014-2022.tif
should be in the data-raw/
subdirectory of your RStudio project. This is a multi-band GeoTIFF in which each band corresponds to a year of landcover data. In R, we’ll use the terra
package to work with raster data.
We’ve also included a lookup table in the data package (data-raw/mcd12q1_fao_classes.csv
) that provides descriptions of each of these classes.
<- read_csv("data-raw/mcd12q1_umd_classes.csv")
lc_classes ::kable(lc_classes) knitr
class | name | label | description |
---|---|---|---|
0 | Water bodies | water | At least 60% of area is covered by permanent water bodies. |
1 | Evergreen Needleleaf Forests | evergreen_needleleaf | Dominated by evergreen conifer trees (canopy >2m). Tree cover >60%. |
2 | Evergreen Broadleaf Forests | evergreen_broadleaf | Dominated by evergreen broadleaf and palmate trees (canopy >2m). Tree cover >60%. |
3 | Deciduous Needleleaf Forests | deciduous_needleleaf | Dominated by deciduous needleleaf (e.g. larch) trees (canopy >2m). Tree cover >60%. |
4 | Deciduous Broadleaf Forests | deciduous_broadleaf | Dominated by deciduous broadleaf trees (canopy >2m). Tree cover >60%. |
5 | Mixed Forests | mixed_forest | Dominated by neither deciduous nor evergreen (40-60% of each) tree type (canopy >2m). Tree cover >60%. |
6 | Closed Shrublands | closed_shrubland | Dominated by woody perennials (1-2m height) >60% cover. |
7 | Open Shrublands | open_shrubland | Dominated by woody perennials (1-2m height) 10-60% cover. |
8 | Woody Savannas | woody_savanna | Tree cover 30-60% (canopy >2m). |
9 | Savannas | savanna | Tree cover 10-30% (canopy >2m). |
10 | Grasslands | grassland | Dominated by herbaceous annuals (<2m). |
12 | Croplands | cropland | At least 60% of area is cultivated cropland. |
13 | Urban and Built-up Lands | urban | At least 30% impervious surface area including building materials, asphalt, and vehicles. |
15 | Non-Vegetated Lands | nonvegetated | At least 60% of area is non-vegetated barren (sand, rock, soil) or permanent snow and ice with less than 10% vegetation. |
255 | Unclassified | unclassified | Has not received a map label because of missing inputs. |
At this point we could use the MODIS land cover data directly, simply extracting the land cover class for each checklist location. However, we instead advocate summarizing the land cover data within a neighborhood around the checklist locations. As discussed in Section 1.1, checklist locations are not precise, so it’s more appropriate to use the habitat in the surrounding area, rather than only at the checklist location. More fundamentally, organisms interact with their environment not at a single point, but at the scale of a landscape, so it’s important to include habitat information characterizing a suitably-sized landscape around the observation location. Based on our experience working with eBird data, a 3 km diameter circular neighborhood centered on each checklist location is sufficient to account for the spatial precision in the data when the maximum distance of travelling counts has been limited to 10km, while also being a relevant ecological scale for many bird species.
There are a variety of landscape metrics that can be used to characterize the composition (what habitat is available) and configuration (how that habitat is arranged spatially) of landscapes. Many of these metrics can be calculated using the R package landscapemetrics
. We’ll use two simple metrics to summarize landcover data in this guide: percent landcover, a measure of composition, and edge density, a measure of configuration.
For example, we can consider a simplified landscape with three cover class: forest, grassland, and water. For each landcover class, percent landcover (abbreviated as pland
) is the percent of the landscape that is composed of that class and edge density (abbreviated as ed
) is the total boundary length of all patches of that class per unit area. For a broad range of scenarios, these two metrics are a reliable choice for calculating environmental covariates in distribution modeling.
We’ll start by finding the full set of unique checklists locations for each year in the eBird data and buffer the points by 1.5km to generate 3 km diameter circular neighborhoods centered on each checklist location. Note that the MODIS landcover data are not available for 2023, so we use the 2022 layer for 2023 checklists.
# ebird checklist locations
<- read_csv("data/checklists-zf_woothr_jun_us-ga.csv") |>
checklists # landcover data not availble for the full period, so we use the closest year
mutate(year_lc = as.character(pmin(year, 2022)))
# generate circular neighborhoods for all checklists
<- checklists |>
checklists_sf # identify unique location/year combinations
distinct(locality_id, year_lc, latitude, longitude) |>
# generate a 3 km neighborhoods
st_as_sf(coords = c("longitude", "latitude"), crs = 4326)
<- st_buffer(checklists_sf, dist = set_units(1.5, "km")) buffers
Next, for each location, we crop and mask the landcover layer corresponding to the checklist year to the circular neighborhood around that checklist. Then we use calculate_lsm()
from landscapemetrics
to calculate pland
and ed
metrics within each neighborhood. This step may take 30 minute or longer to run.
<- NULL
lsm for (i in seq_len(nrow(buffers))) {
<- st_transform(buffers[i, ], crs = crs(landcover))
buffer_i <- as.character(buffer_i$year_lc)
year
# crop and mask landcover raster so all values outside buffer are missing
<- crop(landcover[[year]], buffer_i) |>
lsm[[i]] mask(buffer_i) |>
# calcualte landscape metrics
calculate_lsm(level = "class", metric = c("pland", "ed")) |>
# add variables to uniquely identify each point
mutate(locality_id = buffer_i$locality_id,
year_lc = buffer_i$year_lc) |>
select(locality_id, year_lc, class, metric, value)
}<- bind_rows(lsm) lsm
Finally, we’ll transform the data frame so there’s one row per location and all the environmental variables appear as columns. For each location, any landcover classes that don’t appear within the buffer will not have associated pland
and ed
metrics; at this stage, we replace these implicit missing values with zeros using the complete list of classes in lc_classes
. We also replace the opaque class numbers with more meaningful names from the class description file data-raw/mcd12q1_umd_classes.csv
.
<- lsm |>
lsm_wide # fill missing classes with zeros
complete(nesting(locality_id, year_lc),
class = lc_classes$class,
metric = c("ed", "pland"),
fill = list(value = 0)) |>
# bring in more descriptive names
inner_join(select(lc_classes, class, label), by = "class") |>
# transform from long to wide format
pivot_wider(values_from = value,
names_from = c(class, label, metric),
names_glue = "{metric}_c{str_pad(class, 2, pad = '0')}_{label}",
names_sort = TRUE) |>
arrange(locality_id, year_lc)
3.3 Elevation
In this section we’ll demonstrate how to assign elevation variables, which frequently play an important role in shaping species distributions. Amatulli et al. (2018) provide a suite of global, 1 km resolution topographic variables designed for use in distribution modeling. A range of variables are available, including elevation, slope, roughness, and many others; we’ll focus on elevation here, but the approach can easily be applied to other variables.
All the elevation and topography variables, at various resolutions, are available for download via the website. However, we’ve provided a subset of the 1 km resolution median elevation product covering our study extent data package for this guide. The file elevation_gmted_1km_us-ga.tif
should be in the data-raw/
subdirectory of your RStudio project.
Analogous to how we assigned landcover variables, we’ll calculate the mean and standard deviation of the elevation within 3 km diameter circular neighborhoods centered on each checklist location.
# elevation raster
<- rast("data-raw/elevation_gmted_1km_us-ga.tif")
elevation
# mean and standard deviation within each circular neighborhood
<- exact_extract(elevation, buffers, fun = c("mean", "stdev"),
elev_buffer progress = FALSE) |>
# add variables to uniquely identify each point
mutate(locality_id = buffers$locality_id, year_lc = buffers$year_lc) |>
select(locality_id, year_lc,
elevation_mean = mean,
elevation_sd = stdev)
Now, let’s combine the landcover and elevation variables together, join them back to the checklist data, and save them to be used as model predictors in the upcoming chapters.
# combine elevation and landcover
<- inner_join(elev_buffer, lsm_wide,
env_variables by = c("locality_id", "year_lc"))
# attach and expand to checklists
<- checklists |>
env_variables select(checklist_id, locality_id, year_lc) |>
inner_join(env_variables, by = c("locality_id", "year_lc")) |>
select(-locality_id, -year_lc)
# save to csv, dropping any rows with missing variables
write_csv(drop_na(env_variables),
"data/environmental-variables_checklists_jun_us-ga.csv",
na = "")
3.4 Prediction grid
After training a species distribution model, the goal is typically to make predictions throughout the study area. To do this, we’ll need a prediction grid: a regular grid of habitat variables over which to make predictions. In this section, we’ll create such a prediction grid for our study region (Georgia) using the MODIS landcover data from the most recent year for which they’re available in addition to elevation data. To start, we’ll create a template raster with cells equal in dimension to the diameter of the circular neighborhoods we used above. It’s important to use an equal area coordinate reference system for the prediction grid; we’ll use a Lambert’s azimuthal equal area projection centered on our study region.
Lambert’s azimuthal equal area projection is a good coordinate reference system to use for regional analysis and mapping. It can be tailored to a region of interest by setting the reference latitude and longitude in the projection string to the center of your study region. You can find the coordinates of a region using a mapping tool like Google Maps. Alternatively, you can calculate the centroid of your data using the sf
packagage function st_centroid()
.
|>
checklists_sf st_union() |>
st_centroid() |>
st_coordinates() |>
round(1)
#> X Y
#> [1,] -83.7 33.2
# lambert's azimuthal equal area projection for georgia
<- st_crs("+proj=laea +lat_0=33.2 +lon_0=-83.7")
laea_crs
# study region: georgia
<- read_sf("data/gis-data.gpkg", layer = "ne_states") |>
study_region filter(state_code == "US-GA") |>
st_transform(crs = laea_crs)
# create a raster template covering the region with 3 km resolution
<- rast(study_region, res = c(3000, 3000))
r
# fill the raster with 1s inside the study region
<- rasterize(study_region, r, values = 1) |>
r setNames("study_region")
# save for later use
<- writeRaster(r, "data/prediction-grid_us-ga.tif",
r overwrite = TRUE,
gdal = "COMPRESS=DEFLATE")
Next, we extract the coordinates of the cell centers from the raster we just created, convert these to sf
point features, and buffer these to generate 3 km circular neighborhoods.
# generate neighborhoods for the prediction grid cell centers
<- as.data.frame(r, cells = TRUE, xy = TRUE) |>
buffers_pg select(cell_id = cell, x, y) |>
st_as_sf(coords = c("x", "y"), crs = laea_crs, remove = FALSE) |>
st_transform(crs = 4326) |>
st_buffer(set_units(3, "km"))
Now we can calculate the landcover and elevation variables exactly as we did for the eBird checklists in the previous two sections. First, the landscape metrics pland
and ed
from the landcover data. Note that we use the most recent year of landcover data (i.e. 2022) in this case.
# estimate landscape metrics for each cell in the prediction grid
<- NULL
lsm_pg for (i in seq_len(nrow(buffers_pg))) {
<- st_transform(buffers_pg[i, ], crs = crs(landcover))
buffer_i
# crop and mask landcover raster so all values outside buffer are missing
<- crop(landcover[["2022"]], buffer_i) |>
lsm_pg[[i]] mask(buffer_i) |>
# calcualte landscape metrics
calculate_lsm(level = "class", metric = c("pland", "ed")) |>
# add variable to uniquely identify each point
mutate(cell_id = buffer_i$cell_id) |>
select(cell_id, class, metric, value)
}<- bind_rows(lsm_pg)
lsm_pg
# transform to wide format
<- lsm_pg |>
lsm_wide_pg # fill missing classes with zeros
complete(cell_id,
class = lc_classes$class,
metric = c("ed", "pland"),
fill = list(value = 0)) |>
# bring in more descriptive names
inner_join(select(lc_classes, class, label), by = "class") |>
# transform from long to wide format
pivot_wider(values_from = value,
names_from = c(class, label, metric),
names_glue = "{metric}_c{str_pad(class, 2, pad = '0')}_{label}",
names_sort = TRUE,
values_fill = 0) |>
arrange(cell_id)
And now the mean and standard deviation of elevation.
<- exact_extract(elevation, buffers_pg,
elev_buffer_pg fun = c("mean", "stdev"),
progress = FALSE) |>
# add variables to uniquely identify each point
mutate(cell_id = buffers_pg$cell_id) |>
select(cell_id, elevation_mean = mean, elevation_sd = stdev)
Finally, we combine the landcover and elevation variables together and save to CSV.
# combine landcover and elevation
<- inner_join(elev_buffer_pg, lsm_wide_pg, by = "cell_id")
env_variables_pg
# attach the xy coordinates of the cell centers
<- buffers_pg |>
env_variables_pg st_drop_geometry() |>
select(cell_id, x, y) |>
inner_join(env_variables_pg, by = "cell_id")
# save as csv, dropping any rows with missing variables
write_csv(drop_na(env_variables_pg),
"data/environmental-variables_prediction-grid_us-ga.csv",
na = "")
Keeping these data in a data frame is a compact way to store them and will be required once we make model predictions in later chapters. However, we can always use the raster template to convert these environmental variables into a spatial format, for example, if we want to map them. Let’s look at how this works for percent cover of deciduous broadleaf forest (class 4).
<- env_variables_pg |>
forest_cover # convert to spatial features
st_as_sf(coords = c("x", "y"), crs = laea_crs) |>
# rasterize points
rasterize(r, field = "pland_c04_deciduous_broadleaf")
# make a map
par(mar = c(0.25, 0.25, 2, 0.25))
plot(forest_cover,
axes = FALSE, box = FALSE, col = viridis(10),
main = "Deciduous Broadleaf Forest (% cover)")