Background
The study and conservation of the natural world relies on detailed information about the distributions, abundances, and population trends of species over time. For many taxa, this information is challenging to obtain at relevant geographic scales. The goal of the eBird Status and Trends project is to use data from eBird, the global community science bird monitoring program administered by The Cornell Lab of Ornithology, to generate a reliable, standardized source of biodiversity information for the world’s bird populations. To translate the eBird observations into robust data products, we use machine learning to fill spatiotemporal gaps, using local land cover descriptions derived from remote sensing data, while controlling for biases inherent in species observations collected by community scientists. See Fink et al. (2019) for more information about the analysis used to generate these data.
This vignette gives an overview of the eBird Status Data Products,
which estimate the full annual cycle distributions, relative abundances,
and habitat associations for
r
scales::comma(nrow(ebirdst::ebirdst_runs) - 1)` species
for the year 2022. For each species, distribution and abundance
estimates are available for all 52 weeks of the year across a regular 3
km by 3 km square grid of cells covering the globe. Variation in
detectability associated with the search effort is controlled by
standardizing the estimates as the expected occurrence rate and count of
the species on a 1 hour, 2 km checklist by an expert eBird observer at
the optimal time of day and with optimal weather conditions for
detecting the species.
Data access
Data access is granted through an Access Request Form at: https://ebird.org/st/request. Filling out this form generates a key to be used with this R package. Our terms of use have been designed to be quite permissive in many cases, particularly academic and research use. When requesting data access, please be sure to carefully read the terms of use and ensure that your intended use is not restricted.
After completing the Access Request Form, you will be provided an
eBird Status and Trends Data Products access key, which you will need
when downloading data. To store the key so this package can access it
when downloading data, use the function
set_ebirdst_access_key("XXXXX")
, where "XXXXX"
is the access key provided to you.
There are a wide variety of data products available for download with
ebirdst
via the function
ebirdst_download_status()
. The first argument to this
function defines the species (as a common name, scientific name, or
species code) to download data for and the remaining arguments define
the specific data products to download. Throughout this vignettes, we’ll
use a simplified example dataset consisting of estimates for
Yellow-bellied Sapsucker in Michigan. This dataset is designed to be
small for faster download and is accessible without a key. By default
ebirst_download_status()
only downloads the most commonly
used data products; however, since this vignette will cover all the
available data products, we’ll use download_all = TRUE
.
Note that data for any species other that the example dataset requires a
key to access.
library(dplyr)
library(sf)
library(terra)
library(ebirdst)
# download a simplified example dataset for Yellow-bellied Sapsucker in Michigan
ebirdst_download_status(species = "yebsap-example", download_all = TRUE)
By default, ebirdst_download_status()
downloads data to
a centralized directory for on your computer. You can see what that
directory is with the function ebirdst_data_dir()
and you
can change the default download directory by setting the environment
variable EBIRDST_DATA_DIR
, for example by calling
usethis::edit_r_environ()
and adding a line such as
EBIRDST_DATA_DIR=/custom/download/directory/
.
IMPORTANT: eBird Status and Trends Data Products are designed
to be downloaded and accessed using the ebirdst
R package.
Data downloaded using this R package have a specific file structure and
changing file names or locations will disrupt the ability of functions
in this package to access the data. If you prefer to access data for use
outside of R, consider downloading data via the eBird
Status and Trends website.
Species list
The data frame ebirdst_runs
lists all species with eBird
Status Data Products available for download.
glimpse(ebirdst_runs)
#> Rows: 1,118
#> Columns: 28
#> $ species_code <chr> "abetow", "acafly", "acowoo", "affeag1"…
#> $ scientific_name <chr> "Melozone aberti", "Empidonax virescens…
#> $ common_name <chr> "Abert's Towhee", "Acadian Flycatcher",…
#> $ is_resident <lgl> TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, F…
#> $ breeding_quality <chr> NA, "3", NA, NA, "3", NA, "1", "3", NA,…
#> $ breeding_start <date> NA, 2022-05-24, NA, NA, 2022-06-21, NA…
#> $ breeding_end <date> NA, 2022-08-02, NA, NA, 2022-07-12, NA…
#> $ nonbreeding_quality <chr> NA, "3", NA, NA, "1", NA, "1", "3", NA,…
#> $ nonbreeding_start <date> NA, 2022-12-06, NA, NA, 2022-11-15, NA…
#> $ nonbreeding_end <date> NA, 2022-02-15, NA, NA, 2022-03-29, NA…
#> $ postbreeding_migration_quality <chr> NA, "3", NA, NA, "3", NA, "1", "3", NA,…
#> $ postbreeding_migration_start <date> NA, 2022-08-09, NA, NA, 2022-07-19, NA…
#> $ postbreeding_migration_end <date> NA, 2022-11-29, NA, NA, 2022-11-08, NA…
#> $ prebreeding_migration_quality <chr> NA, "3", NA, NA, "3", NA, "2", "3", NA,…
#> $ prebreeding_migration_start <date> NA, 2022-02-22, NA, NA, 2022-04-05, NA…
#> $ prebreeding_migration_end <date> NA, 2022-05-17, NA, NA, 2022-06-14, NA…
#> $ resident_quality <chr> "3", NA, "3", "2", NA, "2", NA, NA, "3"…
#> $ resident_start <date> 2022-01-04, NA, 2022-01-04, 2022-01-04…
#> $ resident_end <date> 2022-12-27, NA, 2022-12-27, 2022-12-27…
#> $ has_trends <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FA…
#> $ trends_season <chr> "resident", "breeding", "resident", NA,…
#> $ trends_region <chr> "north_america", "north_america", "nort…
#> $ trends_start_year <dbl> 2012, 2012, 2011, NA, 2012, 2015, NA, 2…
#> $ trends_end_year <dbl> 2022, 2022, 2021, NA, 2022, 2022, NA, 2…
#> $ trends_start_date <chr> "01-25", "05-24", "11-01", NA, "06-21",…
#> $ trends_end_date <chr> "05-10", "08-02", "05-03", NA, "07-12",…
#> $ rsquared <dbl> 0.9231821, 0.8570363, 0.8805367, NA, 0.…
#> $ beta0 <dbl> -0.013923012, 0.689424792, -0.092670714…
If you’re working in RStudio, you can use View()
to
interactively explore this data frame.
All species go through a process of review by an expert on that
species prior to being released. The ebirdst_runs
data
frame contains information from this review process. For migrants,
reviewers assess the model estimates for each of the four seasons:
breeding, non-breeding, pre-breeding migration, and post-breeding
migration. Resident (i.e., non-migratory) species are identified by
having TRUE
in the is_resident
column of
ebirdst_runs
, and these species are assessed across the
whole year rather than seasonally. ebirdst_runs
contains
two important pieces of information for each season: a
quality rating and seasonal dates.
The seasonal dates define the weeks that fall within each season. Breeding and non-breeding season dates are defined for each species as the weeks during those seasons when the species’ population does not move. For this reason, these seasons are also described as stationary periods. Migration periods are defined as the periods of movement between the stationary non-breeding and breeding seasons. Note that for many species these migratory periods include not only movement from breeding grounds to non-breeding grounds, but also post-breeding dispersal, molt migration, and other movements.
Reviewers also examine the model estimates for each season to assess the amount of extrapolation or omission present in the model, and assign an associated quality rating ranging from 0 (lowest quality) to 3 (highest quality). Extrapolation refers to cases where the model predicts occurrence where the species is known to be absent, while omission refers to the model failing to predict occurrence where a species is known to be present.
A rating of 0 implies this season failed review and model results should not be used at all for this period. Ratings of 1-3 correspond to a gradient of more to less extrapolation and/or omission, and we often use a traffic light analogy when referring to them:
- Red light (1): low quality, extensive extrapolation and/or omission and noise, but at least some regions have estimates that are accurate; can be used with caution in certain regions.
- Yellow light (2): medium quality, some extrapolation and/or omission; use with caution.
- Green light (3): high quality, very little or no extrapolation and/or omission; these seasons can be safely used.
Let’s look at the results of the review for our example dataset.
ebirdst_runs |>
filter(species_code == "yebsap-example") |>
glimpse()
#> Rows: 1
#> Columns: 28
#> $ species_code <chr> "yebsap-example"
#> $ scientific_name <chr> "Sphyrapicus varius"
#> $ common_name <chr> "Yellow-bellied Sapsucker"
#> $ is_resident <lgl> FALSE
#> $ breeding_quality <chr> "3"
#> $ breeding_start <date> 2022-05-24
#> $ breeding_end <date> 2022-08-16
#> $ nonbreeding_quality <chr> "3"
#> $ nonbreeding_start <date> 2022-11-15
#> $ nonbreeding_end <date> 2022-03-08
#> $ postbreeding_migration_quality <chr> "3"
#> $ postbreeding_migration_start <date> 2022-08-23
#> $ postbreeding_migration_end <date> 2022-11-08
#> $ prebreeding_migration_quality <chr> "3"
#> $ prebreeding_migration_start <date> 2022-03-15
#> $ prebreeding_migration_end <date> 2022-05-17
#> $ resident_quality <chr> NA
#> $ resident_start <date> NA
#> $ resident_end <date> NA
#> $ has_trends <lgl> TRUE
#> $ trends_season <chr> "breeding"
#> $ trends_region <chr> "north_america"
#> $ trends_start_year <dbl> 2012
#> $ trends_end_year <dbl> 2022
#> $ trends_start_date <chr> "05-24"
#> $ trends_end_date <chr> "08-16"
#> $ rsquared <dbl> 0.8572896
#> $ beta0 <dbl> 0.2270008
From this, we can see that Yellow-bellied Sapsucker was modeled as a migrant and all four seasons received a quality of 3, the highest rating. Note that there are a variety of trends-specific columns at the end of this data frame that we’ll ignore for now; these columns will be covered in the trends vignette
Data types
For each species, there are a variety of data products available, which can be categorized into the following broad types:
- Weekly raster estimates: weekly estimates of occurrence, count, relative abundance, and proportion of population on a regular grid in GeoTIFF format at three resolutions. These are the core products from which the other products are derivied.
-
Seasonal raster estimates: seasonal estimates of
occurrence, count, relative abundance, and proportion of population on a
regular grid in GeoTIFF format at three resolutions. These are derived
from the corresponding weekly raster data by summarizing across the
weeks falling within each season based on the dates defined in the
ebirdst_runs
data frame. Only seasons that passed the expert review process are included. - Seasonal range boundaries: seasonal range boundary polygons in GeoPackage format.
- Regional summary statistics: a variety of summary statistics for countries and states/provinces (e.g. proportion of total population in the region) in CSV format.
Each of these data products will be covered in more detail in the
following sections, including details on how to load the data into R.
All of the loading functions take a species (given as common name,
scientific name, or species code) as their first argument. If you have
used a non-default path
argument to
ebirdst_download_status()
then you will also need to
provide the same path
argument to the loading
functions.
Weekly raster estimates
The core raster data products are the weekly estimates of occurrence,
count, relative abundance, and percent of population. These are all
stored in the widely used GeoTIFF raster format, and we refer to them as
“weekly cubes” (e.g. the “weekly abundance cube”). All cubes have 52
weeks and cover the entire globe, even for species with ranges only
covering a small region. They come with areas of predicted and assumed
zeroes, such that any cells that are NA
represent areas
where we didn’t produce model estimates.
All estimates are the median expected value for a 2 km, 1 hour eBird Traveling Count by an expert eBird observer at the optimal time of day and for optimal weather conditions to observe the given species.
- Occurrence: the expected probability of encountering a species.
- Count: the expected count of a species, conditional on its occurrence at the given location.
- Relative abundance: the expected relative abundance of a species, computed as the product of the probability of occurrence and the count conditional on occurrence. In addition to the median relative abundance, upper and lower confidence intervals (CIs) are provided, defined at the 10th and 90th quantile of relative abundance, respectively.
- Proportion of population: the proportion of the total relative abundance within each cell. This is a derived product calculated by dividing each cell value in the relative abundance raster by the sum of all cell values
All predictions are made on a standard 3 km by 3 km global grid; however, for convenience lower resolution GeoTIFFs are also provided, which are typically much faster to work with. However, note that to keep file sizes small, the example dataset only contains lowest (27 km) resolution data. The three resolutions are:
- High resolution (3km): the native 3 km resolution data.
- Medium resolution (9km): the 3 km resolution data aggregated by a factor of 3 in each direction resulting in a resolution of 9 km.
- Low resolution (27km): the 3 km resolution data aggregated by a factor of 9 in each direction resulting in a resolution of 27 km.
The function load_raster()
is used to load these data
into R and takes arguments for product
and
resolution
. The metric
argument can be also be
used to access the relative abundance CIs. All raster products are
loaded into R as SpatRaster
objects for use with the
terra
R package. For example,
# weekly, 27km res, median relative abundance
abd_lr <- load_raster("yebsap-example", product = "abundance",
resolution = "27km")
# weekly, 27km res, median proportion of population
prop_pop_lr <- load_raster("yebsap-example", product = "proportion-population",
resolution = "27km")
# weekly, 27km res, abundance confidence intervals
abd_lower <- load_raster("yebsap-example", product = "abundance", metric = "lower",
resolution = "27km")
abd_upper <- load_raster("yebsap-example", product = "abundance", metric = "upper",
resolution = "27km")
Each object has 52 layers, one for each week of the year, and layer names store the dates corresponding to the midpoints of each week.
as.Date(names(abd_lr))
#> [1] "2022-01-04" "2022-01-11" "2022-01-18" "2022-01-25" "2022-02-01"
#> [6] "2022-02-08" "2022-02-15" "2022-02-22" "2022-03-01" "2022-03-08"
#> [11] "2022-03-15" "2022-03-22" "2022-03-29" "2022-04-05" "2022-04-12"
#> [16] "2022-04-19" "2022-04-26" "2022-05-03" "2022-05-10" "2022-05-17"
#> [21] "2022-05-24" "2022-05-31" "2022-06-07" "2022-06-14" "2022-06-21"
#> [26] "2022-06-28" "2022-07-05" "2022-07-12" "2022-07-19" "2022-07-26"
#> [31] "2022-08-02" "2022-08-09" "2022-08-16" "2022-08-23" "2022-08-30"
#> [36] "2022-09-06" "2022-09-13" "2022-09-20" "2022-09-27" "2022-10-04"
#> [41] "2022-10-11" "2022-10-18" "2022-10-25" "2022-11-01" "2022-11-08"
#> [46] "2022-11-15" "2022-11-22" "2022-11-29" "2022-12-06" "2022-12-13"
#> [51] "2022-12-20" "2022-12-27"
The GeoTIFFs use the same Sinusoidal projection as NASA MODIS data. This projection is ideal for analysis, as it is an equal are projection, but is not ideal for mapping since it introduces significant distortion.
Seasonal raster estimates
The seasonal raster estimates are provided for the same set of
products and at the same three resolutions as the weekly estimates.
They’re derived from the weekly data by taking the cell-wise mean or max
across the weeks within each season. The seasonal boundary dates are
defined through a process of expert review of each species, and are
available in the data frame ebirdst_runs
. Each season is
also given a quality score from 0 (fail) to 3 (high quality), and
seasons with a score of 0 are not provided.
The function load_raster(period = "seasonal")
is used to
load these data into R and takes arguments for product
,
metric
and resolution
. The data are loaded
into R as SpatRaster
objects for use with the
terra
package. For example,
# seasonal, 27km res, mean relative abundance
abd_seasonal_mean <- load_raster("yebsap-example", product = "abundance",
period = "seasonal", metric = "mean",
resolution = "27km")
# season that each layer corresponds to
names(abd_seasonal_mean)
#> [1] "breeding" "nonbreeding" "prebreeding_migration"
#> [4] "postbreeding_migration"
# just the breeding season layer
abd_seasonal_mean[["breeding"]]
#> class : SpatRaster
#> dimensions : 626, 1502, 1 (nrow, ncol, nlyr)
#> resolution : 26665.26, 26665.28 (x, y)
#> extent : -20015109, 20036111, -6684911, 10007555 (xmin, xmax, ymin, ymax)
#> coord. ref. : +proj=sinu +lon_0=0 +x_0=0 +y_0=0 +R=6371007.181 +units=m +no_defs
#> source : yebsap-example_abundance_seasonal_mean_27km_2022.tif
#> name : breeding
#> min value : 0.0000000
#> max value : 0.7763873
# seasonal, 27km res, max occurrence
occ_seasonal_max <- load_raster("yebsap-example", product = "occurrence",
period = "seasonal", metric = "max",
resolution = "27km")
Finally, as a convenience, the data products include year-round
rasters summarizing the mean or max across all weeks that fall within a
season that passed the expert review process. These can be accessed
similarly to the seasonal products, but with
period = "full-year"
instead. For example, these layers can
be used in conservation planning to assess the most important sites
across the full range and full annual cycle of a species.
# full year, 27km res, maximum relative abundance
abd_fy_max <- load_raster("yebsap-example", product = "abundance",
period = "full-year", metric = "max",
resolution = "27km")
Range boundaries
Seasonal range polygons are defined as the boundaries of non-zero
seasonal relative abundance estimates, which are then (optionally)
smoothed to produce more aesthetically pleasing polygons using the
smoothr
package. They are provided in the widely used
GeoPackage format and can be loaded into R with
load_ranges()
, which returns a set of spatial features for
use with the sf
R package. By default the smoothed ranges
are returned, but using smoothed = FALSE
will return the
raw, unsmoothed range polygons. Note that only low and medium resolution
ranges are provided. These range polygons can be loaded with
load_ranges()
:
# seasonal, 27km res, smoothed ranges
ranges <- load_ranges("yebsap-example", resolution = "27km")
ranges
#> Simple feature collection with 4 features and 8 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -90.41254 ymin: 41.69681 xmax: -82.4146 ymax: 48.19076
#> Geodetic CRS: WGS 84
#> # A tibble: 4 × 9
#> species_code scientific_name common_name prediction_year type season
#> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 yebsap Sphyrapicus varius Yellow-bellied S… 2022 range breed…
#> 2 yebsap Sphyrapicus varius Yellow-bellied S… 2022 range nonbr…
#> 3 yebsap Sphyrapicus varius Yellow-bellied S… 2022 range postb…
#> 4 yebsap Sphyrapicus varius Yellow-bellied S… 2022 range prebr…
#> # ℹ 3 more variables: start_date <date>, end_date <date>,
#> # geom <MULTIPOLYGON [°]>
# subset to just the breeding season range using dplyr
range_breeding <- filter(ranges, season == "breeding")
Regional summary statistics
Regional summaries of the seasonal raster estimates are also provided
for a standard set of regions (countries and states/provinces). These
summary statistics can be loaded with
load_regional_stats()
:
regional <- load_regional_stats("yebsap-example")
glimpse(regional)
#> Rows: 8
#> Columns: 10
#> $ species_code <chr> "yebsap-example", "yebsap-example", "yebsap-exa…
#> $ region_type <chr> "country", "country", "country", "country", "st…
#> $ region_code <chr> "USA", "USA", "USA", "USA", "USA-MI", "USA-MI",…
#> $ region_name <chr> "United States", "United States", "United State…
#> $ season <chr> "breeding", "nonbreeding", "postbreeding_migrat…
#> $ abundance_mean <dbl> 0.0246, 0.0571, 0.0381, 0.0435, 0.2061, 0.0003,…
#> $ total_pop_percent <dbl> 0.2314, 0.9002, 0.6583, 0.3826, 0.0332, 0.0001,…
#> $ range_percent_occupied <dbl> 0.0776, 0.2686, 0.3952, 0.3235, 0.5621, 0.0290,…
#> $ range_total_percent <dbl> 0.1861, 0.7017, 0.6036, 0.4728, 0.0232, 0.0013,…
#> $ range_days_occupation <int> 91, 119, 84, 70, 91, 98, 84, 49
The five summary statistics are defined as:
-
abundance_mean
: mean relative abundance in the region. -
total_pop_percent
: proportion of the seasonal modeled population falling within the region. -
range_percent_occupied
: the proportion of the region occupied by the species during the given season. -
range_total_percent
: the proportion of the species seasonal range falling within the region. -
range_days_occupation
: number of days of the season that the region was occupied by this species.