Skip to contents

Background

The study and conservation of the natural world relies on detailed information about the distributions, abundances, and population trends of species over time. For many taxa, this information is challenging to obtain at relevant geographic scales. The goal of the eBird Status and Trends project is to use data from eBird, the global community science bird monitoring program administered by The Cornell Lab of Ornithology, to generate a reliable, standardized source of biodiversity information for the world’s bird populations. To translate the eBird observations into robust data products, we use machine learning to fill spatiotemporal gaps, using local land cover descriptions derived from remote sensing data, while controlling for biases inherent in species observations collected by community scientists. See Fink et al. (2019) for more information about the analysis used to generate these data.

This vignette gives an overview of the eBird Status Data Products, which estimate the full annual cycle distributions, relative abundances, and habitat associations for rscales::comma(nrow(ebirdst::ebirdst_runs) - 1)` species for the year 2022. For each species, distribution and abundance estimates are available for all 52 weeks of the year across a regular 3 km by 3 km square grid of cells covering the globe. Variation in detectability associated with the search effort is controlled by standardizing the estimates as the expected occurrence rate and count of the species on a 1 hour, 2 km checklist by an expert eBird observer at the optimal time of day and with optimal weather conditions for detecting the species.

Data access

Data access is granted through an Access Request Form at: https://ebird.org/st/request. Filling out this form generates a key to be used with this R package. Our terms of use have been designed to be quite permissive in many cases, particularly academic and research use. When requesting data access, please be sure to carefully read the terms of use and ensure that your intended use is not restricted.

After completing the Access Request Form, you will be provided an eBird Status and Trends Data Products access key, which you will need when downloading data. To store the key so this package can access it when downloading data, use the function set_ebirdst_access_key("XXXXX"), where "XXXXX" is the access key provided to you.

There are a wide variety of data products available for download with ebirdst via the function ebirdst_download_status(). The first argument to this function defines the species (as a common name, scientific name, or species code) to download data for and the remaining arguments define the specific data products to download. Throughout this vignettes, we’ll use a simplified example dataset consisting of estimates for Yellow-bellied Sapsucker in Michigan. This dataset is designed to be small for faster download and is accessible without a key. By default ebirst_download_status() only downloads the most commonly used data products; however, since this vignette will cover all the available data products, we’ll use download_all = TRUE. Note that data for any species other that the example dataset requires a key to access.

library(dplyr)
library(sf)
library(terra)
library(ebirdst)

# download a simplified example dataset for Yellow-bellied Sapsucker in Michigan
ebirdst_download_status(species = "yebsap-example", download_all = TRUE)

By default, ebirdst_download_status() downloads data to a centralized directory for on your computer. You can see what that directory is with the function ebirdst_data_dir() and you can change the default download directory by setting the environment variable EBIRDST_DATA_DIR, for example by calling usethis::edit_r_environ() and adding a line such as EBIRDST_DATA_DIR=/custom/download/directory/.

IMPORTANT: eBird Status and Trends Data Products are designed to be downloaded and accessed using the ebirdst R package. Data downloaded using this R package have a specific file structure and changing file names or locations will disrupt the ability of functions in this package to access the data. If you prefer to access data for use outside of R, consider downloading data via the eBird Status and Trends website.

Species list

The data frame ebirdst_runs lists all species with eBird Status Data Products available for download.

glimpse(ebirdst_runs)
#> Rows: 1,118
#> Columns: 28
#> $ species_code                   <chr> "abetow", "acafly", "acowoo", "affeag1"…
#> $ scientific_name                <chr> "Melozone aberti", "Empidonax virescens…
#> $ common_name                    <chr> "Abert's Towhee", "Acadian Flycatcher",…
#> $ is_resident                    <lgl> TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, F…
#> $ breeding_quality               <chr> NA, "3", NA, NA, "3", NA, "1", "3", NA,…
#> $ breeding_start                 <date> NA, 2022-05-24, NA, NA, 2022-06-21, NA
#> $ breeding_end                   <date> NA, 2022-08-02, NA, NA, 2022-07-12, NA
#> $ nonbreeding_quality            <chr> NA, "3", NA, NA, "1", NA, "1", "3", NA,…
#> $ nonbreeding_start              <date> NA, 2022-12-06, NA, NA, 2022-11-15, NA
#> $ nonbreeding_end                <date> NA, 2022-02-15, NA, NA, 2022-03-29, NA
#> $ postbreeding_migration_quality <chr> NA, "3", NA, NA, "3", NA, "1", "3", NA,…
#> $ postbreeding_migration_start   <date> NA, 2022-08-09, NA, NA, 2022-07-19, NA
#> $ postbreeding_migration_end     <date> NA, 2022-11-29, NA, NA, 2022-11-08, NA
#> $ prebreeding_migration_quality  <chr> NA, "3", NA, NA, "3", NA, "2", "3", NA,…
#> $ prebreeding_migration_start    <date> NA, 2022-02-22, NA, NA, 2022-04-05, NA
#> $ prebreeding_migration_end      <date> NA, 2022-05-17, NA, NA, 2022-06-14, NA
#> $ resident_quality               <chr> "3", NA, "3", "2", NA, "2", NA, NA, "3"…
#> $ resident_start                 <date> 2022-01-04, NA, 2022-01-04, 2022-01-04…
#> $ resident_end                   <date> 2022-12-27, NA, 2022-12-27, 2022-12-27…
#> $ has_trends                     <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FA…
#> $ trends_season                  <chr> "resident", "breeding", "resident", NA,…
#> $ trends_region                  <chr> "north_america", "north_america", "nort…
#> $ trends_start_year              <dbl> 2012, 2012, 2011, NA, 2012, 2015, NA, 2…
#> $ trends_end_year                <dbl> 2022, 2022, 2021, NA, 2022, 2022, NA, 2…
#> $ trends_start_date              <chr> "01-25", "05-24", "11-01", NA, "06-21",…
#> $ trends_end_date                <chr> "05-10", "08-02", "05-03", NA, "07-12",…
#> $ rsquared                       <dbl> 0.9231821, 0.8570363, 0.8805367, NA, 0.…
#> $ beta0                          <dbl> -0.013923012, 0.689424792, -0.092670714…

If you’re working in RStudio, you can use View() to interactively explore this data frame.

All species go through a process of review by an expert on that species prior to being released. The ebirdst_runs data frame contains information from this review process. For migrants, reviewers assess the model estimates for each of the four seasons: breeding, non-breeding, pre-breeding migration, and post-breeding migration. Resident (i.e., non-migratory) species are identified by having TRUE in the is_resident column of ebirdst_runs, and these species are assessed across the whole year rather than seasonally. ebirdst_runs contains two important pieces of information for each season: a quality rating and seasonal dates.

The seasonal dates define the weeks that fall within each season. Breeding and non-breeding season dates are defined for each species as the weeks during those seasons when the species’ population does not move. For this reason, these seasons are also described as stationary periods. Migration periods are defined as the periods of movement between the stationary non-breeding and breeding seasons. Note that for many species these migratory periods include not only movement from breeding grounds to non-breeding grounds, but also post-breeding dispersal, molt migration, and other movements.

Reviewers also examine the model estimates for each season to assess the amount of extrapolation or omission present in the model, and assign an associated quality rating ranging from 0 (lowest quality) to 3 (highest quality). Extrapolation refers to cases where the model predicts occurrence where the species is known to be absent, while omission refers to the model failing to predict occurrence where a species is known to be present.

A rating of 0 implies this season failed review and model results should not be used at all for this period. Ratings of 1-3 correspond to a gradient of more to less extrapolation and/or omission, and we often use a traffic light analogy when referring to them:

  1. Red light (1): low quality, extensive extrapolation and/or omission and noise, but at least some regions have estimates that are accurate; can be used with caution in certain regions.
  2. Yellow light (2): medium quality, some extrapolation and/or omission; use with caution.
  3. Green light (3): high quality, very little or no extrapolation and/or omission; these seasons can be safely used.

Let’s look at the results of the review for our example dataset.

ebirdst_runs |> 
  filter(species_code == "yebsap-example") |> 
  glimpse()
#> Rows: 1
#> Columns: 28
#> $ species_code                   <chr> "yebsap-example"
#> $ scientific_name                <chr> "Sphyrapicus varius"
#> $ common_name                    <chr> "Yellow-bellied Sapsucker"
#> $ is_resident                    <lgl> FALSE
#> $ breeding_quality               <chr> "3"
#> $ breeding_start                 <date> 2022-05-24
#> $ breeding_end                   <date> 2022-08-16
#> $ nonbreeding_quality            <chr> "3"
#> $ nonbreeding_start              <date> 2022-11-15
#> $ nonbreeding_end                <date> 2022-03-08
#> $ postbreeding_migration_quality <chr> "3"
#> $ postbreeding_migration_start   <date> 2022-08-23
#> $ postbreeding_migration_end     <date> 2022-11-08
#> $ prebreeding_migration_quality  <chr> "3"
#> $ prebreeding_migration_start    <date> 2022-03-15
#> $ prebreeding_migration_end      <date> 2022-05-17
#> $ resident_quality               <chr> NA
#> $ resident_start                 <date> NA
#> $ resident_end                   <date> NA
#> $ has_trends                     <lgl> TRUE
#> $ trends_season                  <chr> "breeding"
#> $ trends_region                  <chr> "north_america"
#> $ trends_start_year              <dbl> 2012
#> $ trends_end_year                <dbl> 2022
#> $ trends_start_date              <chr> "05-24"
#> $ trends_end_date                <chr> "08-16"
#> $ rsquared                       <dbl> 0.8572896
#> $ beta0                          <dbl> 0.2270008

From this, we can see that Yellow-bellied Sapsucker was modeled as a migrant and all four seasons received a quality of 3, the highest rating. Note that there are a variety of trends-specific columns at the end of this data frame that we’ll ignore for now; these columns will be covered in the trends vignette

Data types

For each species, there are a variety of data products available, which can be categorized into the following broad types:

  • Weekly raster estimates: weekly estimates of occurrence, count, relative abundance, and proportion of population on a regular grid in GeoTIFF format at three resolutions. These are the core products from which the other products are derivied.
  • Seasonal raster estimates: seasonal estimates of occurrence, count, relative abundance, and proportion of population on a regular grid in GeoTIFF format at three resolutions. These are derived from the corresponding weekly raster data by summarizing across the weeks falling within each season based on the dates defined in the ebirdst_runs data frame. Only seasons that passed the expert review process are included.
  • Seasonal range boundaries: seasonal range boundary polygons in GeoPackage format.
  • Regional summary statistics: a variety of summary statistics for countries and states/provinces (e.g. proportion of total population in the region) in CSV format.

Each of these data products will be covered in more detail in the following sections, including details on how to load the data into R. All of the loading functions take a species (given as common name, scientific name, or species code) as their first argument. If you have used a non-default path argument to ebirdst_download_status() then you will also need to provide the same path argument to the loading functions.

Weekly raster estimates

The core raster data products are the weekly estimates of occurrence, count, relative abundance, and percent of population. These are all stored in the widely used GeoTIFF raster format, and we refer to them as “weekly cubes” (e.g. the “weekly abundance cube”). All cubes have 52 weeks and cover the entire globe, even for species with ranges only covering a small region. They come with areas of predicted and assumed zeroes, such that any cells that are NA represent areas where we didn’t produce model estimates.

All estimates are the median expected value for a 2 km, 1 hour eBird Traveling Count by an expert eBird observer at the optimal time of day and for optimal weather conditions to observe the given species.

  • Occurrence: the expected probability of encountering a species.
  • Count: the expected count of a species, conditional on its occurrence at the given location.
  • Relative abundance: the expected relative abundance of a species, computed as the product of the probability of occurrence and the count conditional on occurrence. In addition to the median relative abundance, upper and lower confidence intervals (CIs) are provided, defined at the 10th and 90th quantile of relative abundance, respectively.
  • Proportion of population: the proportion of the total relative abundance within each cell. This is a derived product calculated by dividing each cell value in the relative abundance raster by the sum of all cell values

All predictions are made on a standard 3 km by 3 km global grid; however, for convenience lower resolution GeoTIFFs are also provided, which are typically much faster to work with. However, note that to keep file sizes small, the example dataset only contains lowest (27 km) resolution data. The three resolutions are:

  • High resolution (3km): the native 3 km resolution data.
  • Medium resolution (9km): the 3 km resolution data aggregated by a factor of 3 in each direction resulting in a resolution of 9 km.
  • Low resolution (27km): the 3 km resolution data aggregated by a factor of 9 in each direction resulting in a resolution of 27 km.

The function load_raster() is used to load these data into R and takes arguments for product and resolution. The metric argument can be also be used to access the relative abundance CIs. All raster products are loaded into R as SpatRaster objects for use with the terra R package. For example,

# weekly, 27km res, median relative abundance
abd_lr <- load_raster("yebsap-example", product = "abundance", 
                      resolution = "27km")

# weekly, 27km res, median proportion of population
prop_pop_lr <- load_raster("yebsap-example", product = "proportion-population", 
                      resolution = "27km")

# weekly, 27km res, abundance confidence intervals
abd_lower <- load_raster("yebsap-example", product = "abundance", metric = "lower", 
                         resolution = "27km")
abd_upper <- load_raster("yebsap-example", product = "abundance", metric = "upper", 
                         resolution = "27km")

Each object has 52 layers, one for each week of the year, and layer names store the dates corresponding to the midpoints of each week.

as.Date(names(abd_lr))
#>  [1] "2022-01-04" "2022-01-11" "2022-01-18" "2022-01-25" "2022-02-01"
#>  [6] "2022-02-08" "2022-02-15" "2022-02-22" "2022-03-01" "2022-03-08"
#> [11] "2022-03-15" "2022-03-22" "2022-03-29" "2022-04-05" "2022-04-12"
#> [16] "2022-04-19" "2022-04-26" "2022-05-03" "2022-05-10" "2022-05-17"
#> [21] "2022-05-24" "2022-05-31" "2022-06-07" "2022-06-14" "2022-06-21"
#> [26] "2022-06-28" "2022-07-05" "2022-07-12" "2022-07-19" "2022-07-26"
#> [31] "2022-08-02" "2022-08-09" "2022-08-16" "2022-08-23" "2022-08-30"
#> [36] "2022-09-06" "2022-09-13" "2022-09-20" "2022-09-27" "2022-10-04"
#> [41] "2022-10-11" "2022-10-18" "2022-10-25" "2022-11-01" "2022-11-08"
#> [46] "2022-11-15" "2022-11-22" "2022-11-29" "2022-12-06" "2022-12-13"
#> [51] "2022-12-20" "2022-12-27"

The GeoTIFFs use the same Sinusoidal projection as NASA MODIS data. This projection is ideal for analysis, as it is an equal are projection, but is not ideal for mapping since it introduces significant distortion.

Seasonal raster estimates

The seasonal raster estimates are provided for the same set of products and at the same three resolutions as the weekly estimates. They’re derived from the weekly data by taking the cell-wise mean or max across the weeks within each season. The seasonal boundary dates are defined through a process of expert review of each species, and are available in the data frame ebirdst_runs. Each season is also given a quality score from 0 (fail) to 3 (high quality), and seasons with a score of 0 are not provided.

The function load_raster(period = "seasonal") is used to load these data into R and takes arguments for product, metric and resolution. The data are loaded into R as SpatRaster objects for use with the terra package. For example,

# seasonal, 27km res, mean relative abundance
abd_seasonal_mean <- load_raster("yebsap-example", product = "abundance", 
                                 period = "seasonal", metric = "mean", 
                                 resolution = "27km")
# season that each layer corresponds to
names(abd_seasonal_mean)
#> [1] "breeding"               "nonbreeding"            "prebreeding_migration" 
#> [4] "postbreeding_migration"
# just the breeding season layer
abd_seasonal_mean[["breeding"]]
#> class       : SpatRaster 
#> dimensions  : 626, 1502, 1  (nrow, ncol, nlyr)
#> resolution  : 26665.26, 26665.28  (x, y)
#> extent      : -20015109, 20036111, -6684911, 10007555  (xmin, xmax, ymin, ymax)
#> coord. ref. : +proj=sinu +lon_0=0 +x_0=0 +y_0=0 +R=6371007.181 +units=m +no_defs 
#> source      : yebsap-example_abundance_seasonal_mean_27km_2022.tif 
#> name        :  breeding 
#> min value   : 0.0000000 
#> max value   : 0.7763873

# seasonal, 27km res, max occurrence
occ_seasonal_max <- load_raster("yebsap-example", product = "occurrence", 
                                period = "seasonal", metric = "max", 
                                resolution = "27km")

Finally, as a convenience, the data products include year-round rasters summarizing the mean or max across all weeks that fall within a season that passed the expert review process. These can be accessed similarly to the seasonal products, but with period = "full-year" instead. For example, these layers can be used in conservation planning to assess the most important sites across the full range and full annual cycle of a species.

# full year, 27km res, maximum relative abundance
abd_fy_max <- load_raster("yebsap-example", product = "abundance", 
                          period = "full-year", metric = "max", 
                          resolution = "27km")

Range boundaries

Seasonal range polygons are defined as the boundaries of non-zero seasonal relative abundance estimates, which are then (optionally) smoothed to produce more aesthetically pleasing polygons using the smoothr package. They are provided in the widely used GeoPackage format and can be loaded into R with load_ranges(), which returns a set of spatial features for use with the sf R package. By default the smoothed ranges are returned, but using smoothed = FALSE will return the raw, unsmoothed range polygons. Note that only low and medium resolution ranges are provided. These range polygons can be loaded with load_ranges():

# seasonal, 27km res, smoothed ranges
ranges <- load_ranges("yebsap-example", resolution = "27km")
ranges
#> Simple feature collection with 4 features and 8 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -90.41254 ymin: 41.69681 xmax: -82.4146 ymax: 48.19076
#> Geodetic CRS:  WGS 84
#> # A tibble: 4 × 9
#>   species_code scientific_name    common_name       prediction_year type  season
#>   <chr>        <chr>              <chr>                       <int> <chr> <chr> 
#> 1 yebsap       Sphyrapicus varius Yellow-bellied S…            2022 range breed…
#> 2 yebsap       Sphyrapicus varius Yellow-bellied S…            2022 range nonbr…
#> 3 yebsap       Sphyrapicus varius Yellow-bellied S…            2022 range postb…
#> 4 yebsap       Sphyrapicus varius Yellow-bellied S…            2022 range prebr…
#> # ℹ 3 more variables: start_date <date>, end_date <date>,
#> #   geom <MULTIPOLYGON [°]>

# subset to just the breeding season range using dplyr
range_breeding <- filter(ranges, season == "breeding")

Regional summary statistics

Regional summaries of the seasonal raster estimates are also provided for a standard set of regions (countries and states/provinces). These summary statistics can be loaded with load_regional_stats():

regional <- load_regional_stats("yebsap-example")
glimpse(regional)
#> Rows: 8
#> Columns: 10
#> $ species_code           <chr> "yebsap-example", "yebsap-example", "yebsap-exa…
#> $ region_type            <chr> "country", "country", "country", "country", "st…
#> $ region_code            <chr> "USA", "USA", "USA", "USA", "USA-MI", "USA-MI",…
#> $ region_name            <chr> "United States", "United States", "United State…
#> $ season                 <chr> "breeding", "nonbreeding", "postbreeding_migrat…
#> $ abundance_mean         <dbl> 0.0246, 0.0571, 0.0381, 0.0435, 0.2061, 0.0003,…
#> $ total_pop_percent      <dbl> 0.2314, 0.9002, 0.6583, 0.3826, 0.0332, 0.0001,…
#> $ range_percent_occupied <dbl> 0.0776, 0.2686, 0.3952, 0.3235, 0.5621, 0.0290,…
#> $ range_total_percent    <dbl> 0.1861, 0.7017, 0.6036, 0.4728, 0.0232, 0.0013,…
#> $ range_days_occupation  <int> 91, 119, 84, 70, 91, 98, 84, 49

The five summary statistics are defined as:

  • abundance_mean: mean relative abundance in the region.
  • total_pop_percent: proportion of the seasonal modeled population falling within the region.
  • range_percent_occupied: the proportion of the region occupied by the species during the given season.
  • range_total_percent: the proportion of the species seasonal range falling within the region.
  • range_days_occupation: number of days of the season that the region was occupied by this species.

References

Fink, D., T. Auer, A. Johnston, V. Ruiz‐Gutierrez, W.M. Hochachka, S. Kelling. 2019. Modeling avian full annual cycle distribution and population trends with citizen science data. Ecological Applications, 00(00):e02056. doi: 10.1002/eap.2056