library(dplyr)
# pipes
|>
mtcars mutate(wt_kg = 454 * wt) |>
group_by(cyl) |>
summarize(wt_kg = mean(wt_kg))
#> # A tibble: 3 × 2
#> cyl wt_kg
#> <dbl> <dbl>
#> 1 4 1038.
#> 2 6 1415.
#> 3 8 1816.
# intermediate variables
<- mutate(mtcars, wt_kg = 454 * wt)
mtcars_kg <- group_by(mtcars_kg, cyl)
mtcars_grouped summarize(mtcars_grouped, wt_kg = mean(wt_kg))
#> # A tibble: 3 × 2
#> cyl wt_kg
#> <dbl> <dbl>
#> 1 4 1038.
#> 2 6 1415.
#> 3 8 1816.
# nested function calls
summarize(
group_by(
mutate(mtcars, wt_kg = 454 * wt),
cyl
),wt_kg = mean(wt_kg)
)#> # A tibble: 3 × 2
#> cyl wt_kg
#> <dbl> <dbl>
#> 1 4 1038.
#> 2 6 1415.
#> 3 8 1816.
1 Introduction and Setup
1.1 Introduction
Citizen science data are increasingly making important contributions to ecological research and conservation. One of the most common forms of citizen science data is derived from members of the public recording species observations. eBird (Sullivan et al. 2014) is the largest of these biological citizen science programs. The eBird database contains well over one billion bird observations from every country in the world, with observations of nearly every bird species on Earth. The eBird database is valuable to researchers across the globe, due to its year-round, broad spatial coverage, high volumes of open access data, and applications to many ecological questions. These data have been widely used in scientific research to study phenology, species distributions, population trends, evolution, behavior, global change, and conservation. However, robust inference with eBird data requires careful processing of the data to address the challenges associated with citizen science datasets. This guide, and the associated paper (Johnston et al. 2021), outlines a set of best practices for addressing these challenges and making reliable estimates of species distributions from eBird data.
The next chapter provides an introduction to eBird data, then demonstrates how to access and prepare the data for modeling. The following chapter covers preparing environmental variables to be used as model predictors. The remaining two chapters demonstrate different species distribution models that can be fit using these data: encounter rate models and relative abundance models. Although the examples used throughout this guide focus on eBird data, in many cases the techniques they illustrate also apply to similar citizen science datasets.
1.2 Background knowledge
To understand the code examples used throughout this guide, some knowledge of the programming language R is required. If you don’t meet this requirement, or begin to feel lost trying to understand the code used in this guide, we suggest consulting one of the excellent free resources available online for learning R. For those with little or no prior programming experience, Hands-On Programming with R is an excellent introduction. For those with some familiarity with the basics of R that want to take their skills to the next level, we suggest R for Data Science as the best resource for learning how to work with data within R.
1.2.1 Tidyverse
Throughout this guide, we use packages from the Tidyverse, an opinionated collection of R packages designed for data science. Packages such as ggplot2
, for data visualization, and dplyr
, for data manipulation, are two of the most well known Tidyverse packages; however, there are many more. In the following chapters, we often use Tidyverse functions without explanation. If you encounter a function you’re unfamiliar with, consult the documentation for help (e.g. ?mutate
to see help for the dplyr
function mutate()
). More generally, the free online guide R for Data Science by Hadley Wickham is the best introduction to working with data in R using the Tidyverse.
1.3 The pipe operator
The one specific piece of syntax we cover here, because it is ubiquitous throughout this guide and unfamiliar to some, is the pipe operator |>
. The pipe operator takes the expression to the left of it and “pipes” it into the first argument of the expression on the right, i.e. one can replace f(x)
with x |> f()
. The pipe makes code significantly more readable by avoiding nested function calls, reducing the need for intermediate variables, and making sequential operations read left-to-right. For example, to add a new variable to a data frame, then summarize using a grouping variable, the following are equivalent:
Once you become familiar with the pipe operator, we believe you’ll find the the above example using the pipe the easiest of the three to read and interpret.
1.3.1 Working with spatial data in R
Some familiarity with the main spatial R packages sf
and terra
will be necessary to following along with this guide. The free online book Geocomputation with R is a good resource on working with spatial data in R.
1.4 Setup
1.4.1 Data package
The next two chapters of this guide focus on obtaining and preparing eBird data and environmental variables for the modeling that will occur in the remaining chapters. These steps can be time consuming and laborious. If you’d like to skip straight to the analysis, download this package of prepared data. Unzipping this file should produce two directories: data/
and data-raw/
. Move both these directories so they are subdirectory of your RStudio project folder. This will allow you to jump right in to the modeling and ensure that you’re using exactly the same data as was used when creating this guide. This is a good option if you don’t have a fast enough internet connection to download the eBird data.
1.4.2 Software
The examples throughout this website use the programming language R (R Core Team 2023) to work with eBird data. If you don’t have R installed, download it now, if you already have R, there’s a good chance you have an outdated version, so update it to the latest version now. R is updated regularly, and it is important that you have the most recent version of R to avoid headaches when installing packages. We suggest checking every couple months to see if a new version has been released.
We strongly encourage R users to use RStudio. RStudio is not required to follow along with this guide; however, it will make your R experience significantly better. If you don’t have RStudio, download it now, if you already have it, update it because new versions with useful additional features are regularly released.
Due to the large size of the eBird dataset, working with it requires the Unix command-line utility AWK. You won’t need to use AWK directly, since the R package auk
does this hard work for you, but you do need AWK to be installed on your computer. Linux and Mac users should already have AWK installed on their machines; however, Windows user will need to install Cygwin to gain access to AWK. Cygwin is free software that allows Windows users to use Unix tools. Cygwin should be installed in the default location (C:/cygwin/bin/gawk.exe or C:/cygwin64/bin/gawk.exe) in order for everything to work correctly. Note: there’s no need to do anything at the “Select Utilities” screen, AWK will be installed by default.
1.4.3 R packages
The examples in this guide use a variety of R packages for accessing eBird data, working with spatial data, data processing and manipulation, and model training. To install all the packages necessary to work through this guide, run the following code:
if (!requireNamespace("remotes", quietly = TRUE)) {
install.packages("remotes")
}::install_github("ebird/ebird-best-practices") remotes
Note that several of the spatial packages require dependencies. If installing these packages fails, consult the instructions for installing dependencies on the sf
package website. Finally, ensure all R packages are updated to their most recent versions by clicking on the Update button on the Packages tab in RStudio.
1.4.4 eBird data access
Access to the eBird database is provided via the eBird Basic Dataset (EBD) as tab-separated text files. To access the EBD, begin by creating an eBird account and signing in. Then visit the eBird Data Access page and fill out the data access request form. eBird data access is free for most uses; however, you will need to request access in order to download the EBD. Filling out the access request form allows eBird to keep track of the number of people using the data and obtain information on the applications for which the data are used.
Once you’ve been granted access to the EBD, you will be able to download either the entire eBird dataset or subsets for specific species, regions, or time periods. This is covered in more detail in the next chapter.
1.4.5 GIS data
Throughout this guide, we’ll be producing maps of species distributions. To provide context for these distributions, we’ll need GIS data for political boundaries. Natural Earth is the best source for a range of tightly integrated vector and raster GIS data for producing professional cartographic maps. The R package, rnaturalearth
provides a convenient method for accessing these data from within R.
The data package mentioned in Section 1.4.1 contains a GeoPackage with all the necessary GIS data. However, for reference, the following code was used to generate the GIS dataset. Running this code will create a GeoPackage containing the necessary spatial layers in data/gis-data.gpkg
.
library(dplyr)
library(rnaturalearth)
library(sf)
# file to save spatial data
<- "data/gis-data.gpkg"
gpkg_file dir.create(dirname(gpkg_file), showWarnings = FALSE, recursive = TRUE)
# political boundaries
# land border with lakes removed
<- ne_download(scale = 50, category = "cultural",
ne_land type = "admin_0_countries_lakes",
returnclass = "sf") |>
filter(CONTINENT %in% c("North America", "South America")) |>
st_set_precision(1e6) |>
st_union()
# country boundaries
<- ne_download(scale = 50, category = "cultural",
ne_countries type = "admin_0_countries_lakes",
returnclass = "sf") |>
select(country = ADMIN, country_code = ISO_A2)
# state boundaries for united states
<- ne_download(scale = 50, category = "cultural",
ne_states type = "admin_1_states_provinces",
returnclass = "sf") |>
filter(iso_a2 == "US") |>
select(state = name, state_code = iso_3166_2)
# country lines
# downloaded globally then filtered to north america with st_intersect()
<- ne_download(scale = 50, category = "cultural",
ne_country_lines type = "admin_0_boundary_lines_land",
returnclass = "sf") |>
st_geometry()
<- st_intersects(ne_country_lines, ne_land, sparse = FALSE) |>
lines_on_land as.logical()
<- ne_country_lines[lines_on_land]
ne_country_lines # states, north america
<- ne_download(scale = 50, category = "cultural",
ne_state_lines type = "admin_1_states_provinces_lines",
returnclass = "sf") |>
filter(ADM0_A3 %in% c("USA", "CAN")) |>
mutate(iso_a2 = recode(ADM0_A3, USA = "US", CAN = "CAN")) |>
select(country = ADM0_NAME, country_code = iso_a2)
# save all layers to a geopackage
unlink(gpkg_file)
write_sf(ne_land, gpkg_file, "ne_land")
write_sf(ne_countries, gpkg_file, "ne_countries")
write_sf(ne_states, gpkg_file, "ne_states")
write_sf(ne_country_lines, gpkg_file, "ne_country_lines")
write_sf(ne_state_lines, gpkg_file, "ne_state_lines")
1.5 Session info
This guide was compiled using the latest version of R and all R packages at the time of compilation. If you encounter errors while running code in this guide it is likely that they are being caused by differences in package versions between your R session and the one used to compile this book. To help diagnose this issue, we use devtools::session_info()
to list the versions of all R packages used to compile this guide.
::session_info()
devtools#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.4.1 (2024-06-14)
#> os macOS Sonoma 14.7.1
#> system x86_64, darwin20
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz America/Santiago
#> date 2024-12-12
#> pandoc NA (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> auk * 0.7.0 2023-11-14 [1] CRAN (R 4.4.0)
#> cachem 1.1.0 2024-05-16 [1] CRAN (R 4.4.1)
#> class 7.3-22 2023-05-03 [2] CRAN (R 4.4.1)
#> classInt 0.4-10 2023-09-05 [1] CRAN (R 4.4.0)
#> cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.0)
#> codetools 0.2-20 2024-03-31 [2] CRAN (R 4.4.1)
#> colorspace 2.1-1 2024-07-26 [1] CRAN (R 4.4.1)
#> data.table 1.16.2 2024-10-10 [1] CRAN (R 4.4.1)
#> DBI 1.2.3 2024-06-02 [1] CRAN (R 4.4.0)
#> devtools 2.4.5 2022-10-11 [2] CRAN (R 4.4.0)
#> digest 0.6.35 2024-03-11 [1] CRAN (R 4.4.0)
#> dotCall64 1.1-1 2023-11-28 [1] CRAN (R 4.4.0)
#> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)
#> e1071 1.7-16 2024-09-16 [1] CRAN (R 4.4.1)
#> ebirdst * 3.2022.4 2024-11-20 [1] Github (ebird/ebirdst@62a7597)
#> ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.4.0)
#> evaluate 0.23 2023-11-01 [1] CRAN (R 4.4.0)
#> exactextractr * 0.10.0 2023-09-20 [1] CRAN (R 4.4.0)
#> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)
#> fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.1)
#> fields * 16.2 2024-06-27 [1] CRAN (R 4.4.0)
#> fs 1.6.4 2024-04-25 [2] CRAN (R 4.4.0)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)
#> ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)
#> glue 1.8.0 2024-09-30 [1] CRAN (R 4.4.1)
#> gridExtra * 2.3 2017-09-09 [1] CRAN (R 4.4.0)
#> gtable 0.3.6 2024-10-25 [1] CRAN (R 4.4.1)
#> hms 1.1.3 2023-03-21 [1] CRAN (R 4.4.0)
#> htmltools 0.5.8.1 2024-04-04 [2] CRAN (R 4.4.0)
#> htmlwidgets 1.6.4 2023-12-06 [2] CRAN (R 4.4.0)
#> httpuv 1.6.15 2024-03-26 [2] CRAN (R 4.4.0)
#> jsonlite 1.8.9 2024-09-20 [1] CRAN (R 4.4.1)
#> KernSmooth 2.23-24 2024-05-17 [2] CRAN (R 4.4.1)
#> knitr 1.46 2024-04-06 [1] CRAN (R 4.4.0)
#> landscapemetrics * 2.1.4 2024-07-22 [1] CRAN (R 4.4.0)
#> later 1.3.2 2023-12-06 [2] CRAN (R 4.4.0)
#> lattice 0.22-6 2024-03-20 [2] CRAN (R 4.4.1)
#> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)
#> lobstr 1.1.2 2022-06-22 [2] CRAN (R 4.4.0)
#> lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.4.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)
#> maps 3.4.2 2023-12-15 [1] CRAN (R 4.4.0)
#> Matrix 1.7-0 2024-04-26 [2] CRAN (R 4.4.1)
#> mccf1 * 1.1 2019-11-11 [2] CRAN (R 4.4.0)
#> memoise 2.0.1 2021-11-26 [2] CRAN (R 4.4.0)
#> mgcv * 1.9-1 2023-12-21 [2] CRAN (R 4.4.1)
#> mime 0.12 2021-09-28 [1] CRAN (R 4.4.0)
#> miniUI 0.1.1.1 2018-05-18 [2] CRAN (R 4.4.0)
#> munsell 0.5.1 2024-04-01 [1] CRAN (R 4.4.0)
#> nlme * 3.1-164 2023-11-27 [2] CRAN (R 4.4.1)
#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)
#> pkgbuild 1.4.4 2024-03-17 [1] CRAN (R 4.4.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)
#> pkgload 1.3.4 2024-01-16 [2] CRAN (R 4.4.0)
#> precrec * 0.14.4 2023-10-11 [1] CRAN (R 4.4.0)
#> PresenceAbsence * 1.1.11 2023-01-07 [1] CRAN (R 4.4.0)
#> profvis 0.3.8 2023-05-02 [2] CRAN (R 4.4.0)
#> promises 1.3.0 2024-04-05 [2] CRAN (R 4.4.0)
#> proxy 0.4-27 2022-06-09 [1] CRAN (R 4.4.0)
#> pryr 0.1.6 2023-01-17 [2] CRAN (R 4.4.0)
#> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)
#> ranger * 0.16.3 2024-08-27 [1] Github (imbs-hl/ranger@974d9b6)
#> raster 3.6-30 2024-10-02 [1] CRAN (R 4.4.1)
#> Rcpp 1.0.13-1 2024-11-02 [1] CRAN (R 4.4.1)
#> readr * 2.1.5 2024-01-10 [1] CRAN (R 4.4.0)
#> remotes 2.5.0 2024-03-17 [1] CRAN (R 4.4.0)
#> rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)
#> rmarkdown 2.26 2024-03-05 [2] CRAN (R 4.4.0)
#> rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)
#> scales 1.3.0 2023-11-28 [1] CRAN (R 4.4.0)
#> scam * 1.2-17 2024-06-19 [1] CRAN (R 4.4.0)
#> sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.4.0)
#> sf * 1.0-19 2024-11-05 [1] CRAN (R 4.4.1)
#> shiny 1.8.1.1 2024-04-02 [2] CRAN (R 4.4.0)
#> sp 2.1-4 2024-04-30 [1] CRAN (R 4.4.0)
#> spam * 2.10-0 2023-10-23 [1] CRAN (R 4.4.0)
#> stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.0)
#> stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.4.0)
#> terra * 1.7-78 2024-05-22 [1] CRAN (R 4.4.0)
#> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)
#> tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.4.0)
#> tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)
#> timechange 0.3.0 2024-01-18 [1] CRAN (R 4.4.0)
#> tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.4.0)
#> units * 0.8-5 2023-11-28 [1] CRAN (R 4.4.0)
#> urlchecker 1.0.1 2021-11-30 [2] CRAN (R 4.4.0)
#> usethis 2.2.3 2024-02-19 [2] CRAN (R 4.4.0)
#> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)
#> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)
#> viridis * 0.6.5 2024-01-29 [2] CRAN (R 4.4.0)
#> viridisLite * 0.4.2 2023-05-02 [1] CRAN (R 4.4.0)
#> withr 3.0.2 2024-10-28 [1] CRAN (R 4.4.1)
#> xfun 0.44 2024-05-15 [1] CRAN (R 4.4.0)
#> xtable 1.8-4 2019-04-21 [1] CRAN (R 4.4.0)
#> yaml 2.3.8 2023-12-11 [1] CRAN (R 4.4.0)
#>
#> [1] /Users/mes335/Library/R/x86_64/4.4/library
#> [2] /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library
#>
#> ──────────────────────────────────────────────────────────────────────────────