1  Introduction and Setup

1.1 Introduction

Citizen science data are increasingly making important contributions to ecological research and conservation. One of the most common forms of citizen science data is derived from members of the public recording species observations. eBird (Sullivan et al. 2014) is the largest of these biological citizen science programs. The eBird database contains well over one billion bird observations from every country in the world, with observations of nearly every bird species on Earth. The eBird database is valuable to researchers across the globe, due to its year-round, broad spatial coverage, high volumes of open access data, and applications to many ecological questions. These data have been widely used in scientific research to study phenology, species distributions, population trends, evolution, behavior, global change, and conservation. However, robust inference with eBird data requires careful processing of the data to address the challenges associated with citizen science datasets. This book, and the associated paper (Johnston et al. 2021), outlines a set of best practices for addressing these challenges and making reliable estimates of species distributions from eBird data.

There are three key characteristics that distinguish eBird from many other citizen science projects and facilitate robust ecological analyses: the checklist structure enables non-detection to be inferreda, a flexible set of observation protocols is available to users, and the effort information associated with a checklist facilitates robust analyses by accounting for variation in the observation process (La Sorte et al. 2018; Kelling et al. 2018). When a participant submits data to eBird, sightings of multiple species from the same observation period are grouped together into a single checklist. Complete checklists are those for which the participant reported all birds that they were able to detect and identify. Importantly, this enables scientists to infer counts of zero individuals for the species that were not reported. If checklists are not complete, it’s not possible to ascertain whether the absence of a species on a list was a non-detection or the result of a participant not recording the species. In addition, citizen science projects occur on a spectrum from those with predefined sampling structures that resemble more traditional survey designs, to those that are unstructured and collect observations opportunistically. eBird is a semi-structured project, having flexible, easy to follow protocols that attract many participants, but also collecting data on the observation process (e.g. amount of time spent birding, number of observers, etc.), which can be used in subsequent analyses (Kelling et al. 2018).

Despite the strengths of eBird data, species observations collected through citizen science projects present a number of challenges that are not found in conventional scientific data. The following are some of the primary challenges associated these data; challenges that will be addressed throughout this book:

  • Taxonomic bias: participants often have preferences for certain species, which may lead to preferential recording of some species over others (Greenwood 2007; Tulloch and Szabo 2012). Restricting analyses to complete checklists largely mitigates this issue.
  • Spatial bias: most participants in citizen science surveys sample near their homes (Luck et al. 2004), in easily accessible areas such as roadsides (Kadmon, Farber, and Danin 2004), or in areas and habitats of known high biodiversity (Prendergast et al. 1993). A simple method to reduce the spatial bias that we describe is to create an equal area grid over the region of interest, and sample a given number of checklists from within each grid cell.
  • Temporal bias: participants preferentially sample when they are available, such as weekends (Courter et al. 2013), and at times of year when they expect to observe more birds, notably during spring migration (Sullivan et al. 2014). To address the weekend bias, we recommend using a temporal scale of a week or multiple weeks for most analyses.
  • Spatial precision: the spatial location of an eBird checklist is given as a single latitude-longitude point; however, this may not be precise for two main reasons. First, for traveling checklists, this location represents just one point on the journey. Second, eBird checklists are often assigned to a hotspot (a common location for all birders visiting a popular birding site) rather than their true location. For these reasons, it’s not appropriate to align the eBird locations with very precise habitat variables, and we recommend summarizing variables within a neighborhood around the checklist location.
  • Class imbalance: bird species that are rare or hard to detect may have data with high class imbalance, with many more checklists with non-detections than detections. For these species, a distribution model predicting that the species is absent everywhere will have high accuracy, but no ecological value. We’ll follow the methods for addressing class imbalance proposed by Robinson et al. (2018).
  • Variation in detectability: detectability describes the probability of a species that is present in an area being detected and identified. Detectability varies by season, habitat, and species (Johnston et al. 2014, 2018). Furthermore, eBird data are collected with high variation in effort, time of day, number of observers, and external conditions such as weather, all of which can affect the detectability of species (Ellis and Taylor 2018; Oliveira et al. 2018). Therefore, detectability is particularly important to consider when comparing between seasons, habitats or species. Since eBird uses a semi-structured protocol, that collects variables associated with variation in detectability, we’ll be able to account for a larger proportion of this variation in our analyses.

The remainder of this book will demonstrate how to address these challenges using real data from eBird to produce reliable estimates of species distributions. In general, we’ll take a two-pronged approach to dealing with unstructured data and maximizing the value of citizen science data: imposing more structure onto the data via data filtering and including predictor variables in models to account for the remaining variation.

The next chapter demonstrates how to access and prepare eBird data for modeling. The following chapter covers preparing environmental variables to be used as model predictors. The remaining three chapters provide examples of different species distribution models that can be fit using these data: encounter rate models, relative abundance models, and occupancy models. Although these examples focus on the use of eBird data, in many cases the techniques they illustrate also apply to similar citizen science datasets.

1.2 Prerequisites

To understand the code examples used throughout this book, some knowledge of the programming language R is required. If you don’t meet this requirement, or begin to feel lost trying to understand the code used in this book, we suggest consulting one of the excellent free resources available online for learning R. For those with little or no prior programming experience, Hands-On Programming with R is an excellent introduction. For those with some familiarity with the basics of R that want to take their skills to the next level, we suggest R for Data Science as the best resource for learning how to work with data within R.

1.2.1 Tidyverse

Throughout this book, we use packages from the Tidyverse, an opinionated collection of R packages designed for data science. Packages such as ggplot2, for data visualization, and dplyr, for data manipulation, are two of the most well known Tidyverse packages; however, there are many more. In the following chapters, we often use Tidyverse functions without explanation. If you encounter a function you’re unfamiliar with, consult the documentation for help (e.g. ?mutate to see help for the dplyr function mutate()). More generally, the free online book R for Data Science by Hadley Wickham is the best introduction to working with data in R using the Tidyverse.

The one piece of the Tidyverse that we will cover here, because it is ubiquitous throughout this book and unfamiliar to many, is the pipe operator %>%. The pipe operator takes the expression to the left of it and “pipes” it into the first argument of the expression on the right, i.e. one can replace f(x) with x %>% f(). The pipe makes code significantly more readable by avoiding nested function calls, reducing the need for intermediate variables, and making sequential operations read left-to-right. For example, to add a new variable to a data frame, then summarize using a grouping variable, the following are equivalent:


# pipes
mtcars %>% 
  mutate(wt_kg = 454 * wt) %>% 
  group_by(cyl) %>% 
  summarize(wt_kg = mean(wt_kg))
#> # A tibble: 3 × 2
#>     cyl wt_kg
#>   <dbl> <dbl>
#> 1     4 1038.
#> 2     6 1415.
#> 3     8 1816.

# intermediate variables
mtcars_kg <- mutate(mtcars, wt_kg = 454 * wt)
mtcars_grouped <- group_by(mtcars_kg, cyl)
summarize(mtcars_grouped, wt_kg = mean(wt_kg))
#> # A tibble: 3 × 2
#>     cyl wt_kg
#>   <dbl> <dbl>
#> 1     4 1038.
#> 2     6 1415.
#> 3     8 1816.

# nested function calls
    mutate(mtcars, wt_kg = 454 * wt),
  wt_kg = mean(wt_kg)
#> # A tibble: 3 × 2
#>     cyl wt_kg
#>   <dbl> <dbl>
#> 1     4 1038.
#> 2     6 1415.
#> 3     8 1816.

Once you become familiar with the pipe operator, we believe you’ll find the the above example using the pipe the easiest of the three to read and interpret.


Rewrite the following code using pipes:

round(log(runif(10, min = 0.5)), 1)
#>  [1] -0.5 -0.4 -0.2  0.0 -0.5 -0.1  0.0 -0.2 -0.2 -0.6
runif(10, min = 0.5) %>% 
  log() %>% 
  round(digits = 1)
#>  [1] -0.5 -0.4 -0.2  0.0 -0.5 -0.1  0.0 -0.2 -0.2 -0.6

1.3 Setup

1.3.1 Data package

The next two chapters of this book focus on obtaining and preparing eBird data and environmental variables for the modeling that will occur in the remaining chapters. These steps can be time consuming and laborious. If you’d like to skip straight to the analysis, download this package of prepared data. Unzip this file so that the contents are in the data/ subdirectory of your RStudio project folder. This will allow you to jump right in to the modeling and ensure that you’re using exactly the same data as was used when creating this book. This is a good option if you don’t have a fast enough internet connection to download the eBird data.

1.3.2 Software

The examples throughout this website use the programming language R (R Core Team 2023) to work with eBird data. If you don’t have R installed, download it now, if you already have R, there’s a good chance you have an outdated version, so update it to the latest version now. R is updated regularly, and it is important that you have the most recent version of R to avoid headaches when installing packages. We suggest checking every couple months to see if a new version has been released.

We strongly encourage R users to use RStudio. RStudio is not required to follow along with this book; however, it will make your R experience significantly better. If you don’t have RStudio, download it now, if you already have it, update it because new versions with useful additional features are regularly released.

Due to the large size of the eBird dataset, working with it requires the Unix command-line utility AWK. You won’t need to use AWK directly, since the R package auk does this hard work for you, but you do need AWK to be installed on your computer. Linux and Mac users should already have AWK installed on their machines; however, Windows user will need to install Cygwin to gain access to AWK. Cygwin is free software that allows Windows users to use Unix tools. Cygwin should be installed in the default location (C:/cygwin/bin/gawk.exe or C:/cygwin64/bin/gawk.exe) in order for everything to work correctly. Note: there’s no need to do anything at the “Select Utilities” screen, AWK will be installed by default.

1.3.3 R packages

The examples in this book use a variety of R packages for accessing eBird data, working with spatial data, data processing and manipulation, and model training. To install all the packages necessary to work through this book, run the following code:

if (!requireNamespace("remotes", quietly = TRUE)) {

Note that several of the spatial packages require dependencies. If installing these packages fails, consult the instructions for installing dependencies on the sf package website. Finally, ensure all R packages are updated to their most recent versions by clicking on the Update button on the Packages tab in RStudio.

1.3.4 eBird data access

Access to the eBird database is provided via the eBird Basic Dataset (EBD) as tab-separated text files. To access the EBD, begin by creating an eBird account and signing in. Then visit the eBird Data Access page and fill out the data access request form. eBird data access is free for most uses; however, you will need to request access in order to download the EBD. Filling out the access request form allows eBird to keep track of the number of people using the data and obtain information on the applications for which the data are used.

Once you’ve been granted access to the EBD, you will be able to download either the entire eBird dataset or subsets for specific species, regions, or time periods. This is covered in more detail in the next chapter.

1.3.5 GIS data

Throughout this book, we’ll be producing maps of species distributions. To provide context for these distributions, we’ll need GIS data for political boundaries. Natural Earth is the best source for a range of tightly integrated vector and raster GIS data for producing professional cartographic maps. The R package, rnaturalearth provides a convenient method for accessing these data from within R.

The data package mentioned in Section 1.3.1 contains a GeoPackage with all the necessary GIS data. However, for reference, the following code was used to generate the GIS dataset. Running this code will create a GeoPackage containing the necessary spatial layers in data/gis-data.gpkg.


# file to save spatial data
gpkg_file <- "data/gis-data.gpkg"
dir.create(dirname(gpkg_file), showWarnings = FALSE, recursive = TRUE)

# political boundaries
# land border with lakes removed
ne_land <- ne_download(scale = 50, category = "cultural",
                       type = "admin_0_countries_lakes",
                       returnclass = "sf") %>%
  filter(CONTINENT == "North America") %>%
  st_set_precision(1e6) %>%
# state boundaries for united states
ne_states <- ne_download(scale = 50, category = "cultural",
                       type = "admin_1_states_provinces",
                       returnclass = "sf") %>% 
  filter(iso_a2 == "US") %>% 
  select(state = name, state_code = iso_3166_2)
# country lines
# downloaded globally then filtered to north america with st_intersect()
ne_country_lines <- ne_download(scale = 50, category = "cultural",
                                type = "admin_0_boundary_lines_land",
                                returnclass = "sf") %>% 
ne_country_lines <- st_intersects(ne_country_lines, ne_land, sparse = FALSE) %>%
  as.logical() %>%
# states, north america
ne_state_lines <- ne_download(scale = 50, category = "cultural",
                              type = "admin_1_states_provinces_lines",
                              returnclass = "sf") %>%
  filter(ADM0_A3 %in% c("USA", "CAN")) %>%
  mutate(iso_a2 = recode(ADM0_A3, USA = "US", CAN = "CAN")) %>% 
  select(country = ADM0_NAME, country_code = iso_a2)

# save all layers to a geopackage
write_sf(ne_land, gpkg_file, "ne_land")
write_sf(ne_states, gpkg_file, "ne_states")
write_sf(ne_country_lines, gpkg_file, "ne_country_lines")
write_sf(ne_state_lines, gpkg_file, "ne_state_lines")