1  eBird Data

eBird data are collected and organized around the concept of a checklist, representing observations from a single birding event, such as a 1 km walk through a park or 15 minutes observing bird feeders in your backyard. All eBird checklists contains a list of species observed and the location and time of the observations. For a subset of checklists, the observer will also provide counts of the number of individuals seen of each species, specify the amount of effort expended while collecting these data, and confirm that they are submitting a complete checklist of all the birds they were able to identify. The data provided by the checklists such as these is often referred to as semi-structured citizen science data.

Let’s compare two eBird checklists: an incidental observation with missing counts and a complete traveling count. Both checklists can be useful, but only the second checklist provides the type of semi-structured data required for more rigorous applications.

The first part of this lesson will work with the eBird Basic Dataset (EBD). This will allow users to prepare eBird data for relative abundance modeling up to the point of assigning environmental variables as predictors and will be useful for users after the workshop. For the sake of brevity, the second part of this lesson will provide data from the eBird Reference Dataset (ERD), the collection of semi-structured eBird data used by the eBird Status and Trends team for modeling species distributions and relative abundance. A subset of this dataset will be provided in the data package for this workshop.

1.1 raw eBird Data (EBD)

eBird data are released as two tab-separated text files: the eBird Basic Dataset (EBD) containing observation data and the Sampling Event Data (SED) containing checklist data. These files are released monthly and contain all validated bird sightings in the eBird database at the time of release. In the EBD, each row corresponds to the sighting of a single species on a checklist, including the count and any other species-level information (e.g. age, sex, species comments, etc.). In the SED, each row corresponds to a checklist, including the date, time, location, effort (e.g. distance traveled, time spent, etc.), and any additional checklist-level information (e.g. whether this is a complete checklist or not).

1.1.1 Downloading data

In this workshop, we’ll use Shining Bronze-Cuckoo observations from Queensland, Australia as an example. We’ll start by downloading the corresponding eBird observation (EBD) and checklist (SED) data by visiting the eBird Basic Dataset download page and filling out the Custom Download form to request Shining Bronze-Cuckoo observations from Queensland. Make sure you check the box “Include sampling event data”, which will include the SED in the data download in addition to the EBD.

Tip

The eBird database contains a massive amount of data! When requesting eBird data to download it’s important to narrow the request to as small a subset of the data as possible. For example, if we request all Shining Bronze-Cuckoo observations globally, the dataset may be too large to work with in R. Instead, we’ve only requested data for a single state in Australia.

Once the data are ready, you will receive an email with a download link. The downloaded data will be in a compressed .zip format, and should be unarchived. The resulting directory will contain a two text files: one for the EBD (e.g. ebd_AU-QLD_shbcuc1_smp_relSep-2023.txt) containing all the Shining Bronze-Cuckoo observations from Queensland, and one for the SED (e.g. ebd_AU-QLD_shbcuc1_smp_relSep-2023_sampking.txt) containing all checklists from Queensland, The relSep-2023 component of the file name describes which version of the EBD this dataset came from; in this case it’s the September 2023 release.

If you would prefer to directly download the exact dataset used in this workshop, download the data package for this workshop.

1.1.2 Importing eBird data into R

The previous step left us with two tab separated text files, one for the EBD (i.e. observation data) and one for the SED (i.e. checklist data). Start a new RStudio project and put the downloaded text files in the data/ sub-directory of the project directory.

The auk R package is specifically designed for working with eBird data. It includes the functions read_ebd() and read_sampling() for importing the EBD and SED, respectively, into R. First let’s import the checklist data (SED).

library(auk)
library(dplyr)
library(ggplot2)
library(lubridate)
library(sf)

f_sed <- "data/ebd_AU-QLD_shbcuc1_smp_relSep-2023_sampling.txt"
checklists <- read_sampling(f_sed, unique = FALSE)
glimpse(checklists)
#> Rows: 755,445
#> Columns: 30
#> $ last_edited_date          <chr> "2021-04-19 02:40:19.157994", "2021-10-22 20…
#> $ country                   <chr> "Australia", "Australia", "Australia", "Aust…
#> $ country_code              <chr> "AU", "AU", "AU", "AU", "AU", "AU", "AU", "A…
#> $ state                     <chr> "Queensland", "Queensland", "Queensland", "Q…
#> $ state_code                <chr> "AU-QLD", "AU-QLD", "AU-QLD", "AU-QLD", "AU-…
#> $ county                    <chr> "Brisbane", "Brisbane", "Brisbane", "Brisban…
#> $ county_code               <chr> "AU-QLD-BRI", "AU-QLD-BRI", "AU-QLD-BRI", "A…
#> $ iba_code                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ bcr_code                  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ usfws_code                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ atlas_block               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ locality                  <chr> "Dowse Lagoon (Sandgate)", "Dowse Lagoon (Sa…
#> $ locality_id               <chr> "L1896576", "L1896576", "L1896576", "L189657…
#> $ locality_type             <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H",…
#> $ latitude                  <dbl> -27.3, -27.3, -27.3, -27.3, -27.3, -27.3, -2…
#> $ longitude                 <dbl> 153, 153, 153, 153, 153, 153, 153, 153, 153,…
#> $ observation_date          <date> 2021-04-19, 2021-10-23, 2021-04-25, 2021-11…
#> $ time_observations_started <chr> "15:40:00", "06:04:00", "12:25:00", "13:53:0…
#> $ observer_id               <chr> "obs427309", "obs427309", "obs535737", "obs9…
#> $ sampling_event_identifier <chr> "S85875959", "S96568064", "S86299492", "S978…
#> $ protocol_type             <chr> "Traveling", "Traveling", "Stationary", "Tra…
#> $ protocol_code             <chr> "P22", "P22", "P21", "P22", "P22", "P21", "P…
#> $ project_code              <chr> "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD_A…
#> $ duration_minutes          <int> 59, 138, 33, 46, 75, 16, 67, 104, 28, 45, NA…
#> $ effort_distance_km        <dbl> 1.954, 3.437, NA, 1.610, 1.030, NA, 1.310, 2…
#> $ effort_area_ha            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ number_observers          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ all_species_reported      <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
#> $ group_identifier          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ trip_comments             <chr> NA, NA, NA, "Clear skies light breeze. Targe…
Checkpoint

Take some time to explore the variables in the checklist dataset. If you’re unsure about any of the variables, consult the metadata document that came with the data download (eBird_Basic_Dataset_Metadata_v1.14.pdf).

For some applications, only the checklist data are required. For example, the checklist data can be used to investigate the spatial and temporal distribution of eBird data within a region. This dataset can also be used to explore how much variation there is in the observation effort variables and identify checklists that have low spatial or temporal precision.

Exercise

Make a histogram of the distribution of distance traveling for traveling protocol checklists.

Nearly 90% of checklists are less than 10km in length; however, some checklists are as long as 80km in length. Long traveling checklists have lower spatial precision so they should generally be removed prior to analysis.

checklists_traveling <- filter(checklists, protocol_type == "Traveling")
ggplot(checklists_traveling) +
  aes(x = effort_distance_km) +
  geom_histogram(binwidth = 5) +
  scale_y_continuous(limits = c(0, NA), labels = scales::comma) +
  labs(x = "Distance traveled [km]",
       y = "# of eBird checklists",
       title = "Distribution of distance traveled on eBird checklists")

Next, let’s load the observation data.

f_ebd <- "data/ebd_AU-QLD_shbcuc1_smp_relSep-2023.txt"
observations <- read_ebd(f_ebd, unique = FALSE, rollup = FALSE)
glimpse(observations)
#> Rows: 23,250
#> Columns: 49
#> $ global_unique_identifier   <chr> "URN:CornellLabOfOrnithology:EBIRD:OBS90068…
#> $ last_edited_date           <chr> "2020-04-23 01:19:06", "2021-03-24 06:05:52…
#> $ taxonomic_order            <dbl> 3281, 3281, 3281, 3284, 3281, 3281, 3281, 3…
#> $ category                   <chr> "species", "species", "species", "issf", "s…
#> $ taxon_concept_id           <chr> "avibase-19268395", "avibase-19268395", "av…
#> $ common_name                <chr> "Shining Bronze-Cuckoo", "Shining Bronze-Cu…
#> $ scientific_name            <chr> "Chrysococcyx lucidus", "Chrysococcyx lucid…
#> $ subspecies_common_name     <chr> NA, NA, NA, "Shining Bronze-Cuckoo (Shining…
#> $ subspecies_scientific_name <chr> NA, NA, NA, "Chrysococcyx lucidus lucidus",…
#> $ exotic_code                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ observation_count          <chr> "1", "1", "1", "1", "1", "3", "2", "1", "1"…
#> $ breeding_code              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ breeding_category          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ behavior_code              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ age_sex                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ country                    <chr> "Australia", "Australia", "Australia", "Aus…
#> $ country_code               <chr> "AU", "AU", "AU", "AU", "AU", "AU", "AU", "…
#> $ state                      <chr> "Queensland", "Queensland", "Queensland", "…
#> $ state_code                 <chr> "AU-QLD", "AU-QLD", "AU-QLD", "AU-QLD", "AU…
#> $ county                     <chr> "Mackay", "Moreton Bay", "Ipswich", "Sunshi…
#> $ county_code                <chr> "AU-QLD-MAC", "AU-QLD-MRB", "AU-QLD-IPS", "…
#> $ iba_code                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ bcr_code                   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ usfws_code                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ atlas_block                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ locality                   <chr> "Finch Hatton Gorge", "Lake Samsonvale--Gol…
#> $ locality_id                <chr> "L921597", "L692632", "L4784405", "L1827741…
#> $ locality_type              <chr> "H", "H", "P", "H", "H", "H", "P", "H", "H"…
#> $ latitude                   <dbl> -21.1, -27.3, -27.5, -26.6, -27.7, -27.3, -…
#> $ longitude                  <dbl> 149, 153, 153, 153, 153, 153, 153, 153, 153…
#> $ observation_date           <date> 2019-08-30, 2019-10-23, 2019-09-21, 2019-0…
#> $ time_observations_started  <chr> "05:51:00", "05:09:00", "06:30:00", "09:41:…
#> $ observer_id                <chr> "obsr1490162", "obsr277790", "obsr186524", …
#> $ sampling_event_identifier  <chr> "S67655665", "S61014875", "S59994086", "S53…
#> $ protocol_type              <chr> "Stationary", "Traveling", "Traveling", "Tr…
#> $ protocol_code              <chr> "P21", "P22", "P22", "P22", "P22", "P22", "…
#> $ project_code               <chr> "EBIRD_AU", "EBIRD_AU", "EBIRD", "EBIRD", "…
#> $ duration_minutes           <int> 540, 361, 95, 61, 240, 115, 70, 85, 60, 141…
#> $ effort_distance_km         <dbl> NA, 5.240, 0.701, 2.000, 4.500, 1.000, 1.27…
#> $ effort_area_ha             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ number_observers           <int> 1, 1, 1, 2, 2, 1, 2, 1, 1, 2, 3, 1, 1, 4, 1…
#> $ all_species_reported       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
#> $ group_identifier           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "G43416…
#> $ has_media                  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
#> $ approved                   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
#> $ reviewed                   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
#> $ reason                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ trip_comments              <chr> "Transferred from Daily Checklist.", NA, NA…
#> $ species_comments           <chr> "Heard.", NA, NA, NA, NA, NA, "1 immature."…
Checkpoint

Take some time to explore the variables in the observation dataset. Notice that the EBD duplicates many of the checklist-level variables from the SED.

When we read the data into R, we used unique = FALSE and rollup = FALSE. By default the read functions in auk perform two important pre-processing steps: combining duplicate shared checklists and taxonomic rollup. We intentionally turned off this functionality for the purposes of demonstration.

1.1.2.1 Shared checklists

eBird allows users to share checklists with other eBirders in their group, for example this checklist is shared by 47 observers! These checklists can be identified by looking at the group_identifier variable, which assigns an ID connecting all checklists in the group.

checklists %>% 
  filter(!is.na(group_identifier)) %>% 
  arrange(group_identifier) %>% 
  select(sampling_event_identifier, group_identifier)
#> # A tibble: 162,127 × 2
#>   sampling_event_identifier group_identifier
#>   <chr>                     <chr>           
#> 1 S133890080                G10005231       
#> 2 S133784377                G10005231       
#> 3 S133893592                G10005466       
#> 4 S133893590                G10005466       
#> 5 S133897212                G10005766       
#> 6 S133883873                G10005766       
#> # ℹ 162,121 more rows

Checklists with the same group_identifier provide duplicate information on the same birding event in the eBird database. For most analyses, it’s important to collapse these shared checklists down into a single checklist. This can be accomplished with the function auk_unique(), which retains only one independent copy of each checklist.

checklists_unique <- auk_unique(checklists, checklists_only = TRUE)
nrow(checklists)
#> [1] 755445
nrow(checklists_unique)
#> [1] 662253

Notice that a new variable, checklist_id, was created that is set to group_identifier for shared checklists and sampling_event_identifier for non-shared checklists.

head(checklists_unique$checklist_id)
#> [1] "S85875959" "S96568064" "S86299492" "S97805740" "S94758500" "S83599027"
tail(checklists_unique$checklist_id)
#> [1] "G7637197" "G7638838" "G7638782" "G7638869" "G7641275" "G7641274"
Tip

Curious what checklists and observers contributed to a shared checklist after it has been collapsed? The sampling_event_identifier and observer_id contain comma-separated lists of all checklists and observers that went into the shared checklists.

checklists_unique %>% 
  filter(checklist_id == "G10638158") %>%
  select(checklist_id, group_identifier, sampling_event_identifier, observer_id)
#> # A tibble: 1 × 4
#>   checklist_id group_identifier sampling_event_identifier            observer_id
#>   <chr>        <chr>            <chr>                                <chr>      
#> 1 G10638158    G10638158        S145972310,S145972882,S145973235,S1… obs1013108…

1.1.2.2 Taxonomic rollup

eBird observations can be made at levels below species (e.g. subspecies) or above species (e.g. a bird that was identified as a duck, but the species could not be determined); however, for most uses we’ll want observations at the species level. This is especially true if we want to produce detection/non-detection data from complete checklists because “complete” only applies at the species level.

Tip

In the example dataset used for this workshop, these taxonomic issues don’t apply. We have requested Shining Bronze-Cuckoo observations, so we haven’t received any observations for taxa above species. However, there are records of Shining Bronze-Cuckoo (Shining) and Shining Bronze-Cuckoo (Golden) in Queensland during this time period (e.g., this checklist with photos. Accordingly, to use all records, we need to rollup these two subspecies into one set of species-level information. This can happen in many situations. For example, this checklist has 10 Yellow-rumped Warblers, 5 each of two Yellow-rumped Warbler subspecies, and one hybrid between the two subspecies. auk_rollup() will combine all four of these observations into a single Yellow-rumped Warbler observation.

The function auk_rollup() drops all observations not identifiable to a species and rolls up all observations reported below species to the species level.

observations_rollup <- auk_rollup(observations)

# only checklist example with both subspecies and one species-level entry
observations %>% 
  filter(sampling_event_identifier == "S102653162") %>%
  select(sampling_event_identifier, common_name, subspecies_common_name, 
         observation_count)
#> # A tibble: 3 × 4
#>   sampling_event_identifier common_name subspecies_common_name observation_count
#>   <chr>                     <chr>       <chr>                  <chr>            
#> 1 S102653162                Shining Br… Shining Bronze-Cuckoo… 1                
#> 2 S102653162                Shining Br… <NA>                   2                
#> 3 S102653162                Shining Br… Shining Bronze-Cuckoo… 1
observations_rollup %>% 
  filter(sampling_event_identifier == "S102653162") %>%
  select(sampling_event_identifier, common_name,
         observation_count)
#> # A tibble: 1 × 3
#>   sampling_event_identifier common_name           observation_count
#>   <chr>                     <chr>                 <chr>            
#> 1 S102653162                Shining Bronze-Cuckoo 4
Tip

If multiple taxa on a single checklist roll up to the same species, auk_rollup() attempts to combine them intelligently. If each observation has a count, those counts are added together, but if any of the observations is missing a count (i.e. the count is “X”) the combined observation is also assigned an “X”. In the example checklist from the previous tip, with four taxa all rolling up to Yellow-rumped Warbler, auk_rollup() will add the four counts together to get 21 Yellow-rumped Warblers (10 + 5 + 5 + 1).

1.1.3 Generating detection/non-detection data

Complete eBird checklists are extremely valuable because, for all species that weren’t reported, we can infer counts of 0. This allows us to convert eBird from presence only data to detection/non-detection data, which allows for much more robust analyses. Note that we don’t use the term presence/absence data here because a non-detection doesn’t necessarily imply the species was absent, only that the observer didn’t detect and identify it.

We refer to the process of producing detection/non-detection data as “zero-filling” the eBird data because we’re filling in the missing zeros. We’ll read the eBird data into R again, filter to only complete checklists, then use the function auk_zerofill() to generate detection/non-detection data. Note that shared checklists are combined and taxonomic rollup is performed by default when using the read_*() functions from auk.

# import checklist data
checklists <- read_sampling(f_sed) %>% 
  # subset to complete checklists
  filter(all_species_reported)
# import observation data
observations <- read_ebd(f_ebd) %>% 
  # subset to complete checklists
  filter(all_species_reported)
# zero-fill to produce detection/non-detection data
zf <- auk_zerofill(observations, checklists, collapse = TRUE)
glimpse(zf)
#> Rows: 509,294
#> Columns: 38
#> $ checklist_id              <chr> "S85875959", "S96568064", "S86299492", "S978…
#> $ last_edited_date          <chr> "2021-04-19 02:40:19.157994", "2021-10-22 20…
#> $ country                   <chr> "Australia", "Australia", "Australia", "Aust…
#> $ country_code              <chr> "AU", "AU", "AU", "AU", "AU", "AU", "AU", "A…
#> $ state                     <chr> "Queensland", "Queensland", "Queensland", "Q…
#> $ state_code                <chr> "AU-QLD", "AU-QLD", "AU-QLD", "AU-QLD", "AU-…
#> $ county                    <chr> "Brisbane", "Brisbane", "Brisbane", "Brisban…
#> $ county_code               <chr> "AU-QLD-BRI", "AU-QLD-BRI", "AU-QLD-BRI", "A…
#> $ iba_code                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ bcr_code                  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ usfws_code                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ atlas_block               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ locality                  <chr> "Dowse Lagoon (Sandgate)", "Dowse Lagoon (Sa…
#> $ locality_id               <chr> "L1896576", "L1896576", "L1896576", "L189657…
#> $ locality_type             <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H",…
#> $ latitude                  <dbl> -27.3, -27.3, -27.3, -27.3, -27.3, -27.3, -2…
#> $ longitude                 <dbl> 153, 153, 153, 153, 153, 153, 153, 153, 153,…
#> $ observation_date          <date> 2021-04-19, 2021-10-23, 2021-04-25, 2021-11…
#> $ time_observations_started <chr> "15:40:00", "06:04:00", "12:25:00", "13:53:0…
#> $ observer_id               <chr> "obs427309", "obs427309", "obs535737", "obs9…
#> $ sampling_event_identifier <chr> "S85875959", "S96568064", "S86299492", "S978…
#> $ protocol_type             <chr> "Traveling", "Traveling", "Stationary", "Tra…
#> $ protocol_code             <chr> "P22", "P22", "P21", "P22", "P22", "P21", "P…
#> $ project_code              <chr> "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD_A…
#> $ duration_minutes          <int> 59, 138, 33, 46, 75, 16, 67, 104, 28, 45, 80…
#> $ effort_distance_km        <dbl> 1.954, 3.437, NA, 1.610, 1.030, NA, 1.310, 2…
#> $ effort_area_ha            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ number_observers          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ all_species_reported      <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
#> $ group_identifier          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ trip_comments             <chr> NA, NA, NA, "Clear skies light breeze. Targe…
#> $ scientific_name           <chr> "Chrysococcyx lucidus", "Chrysococcyx lucidu…
#> $ breeding_code             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ breeding_category         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ behavior_code             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ age_sex                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ observation_count         <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0",…
#> $ species_observed          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…

The observation_count variable has true counts as well as “X”s, which indicate that the species was detected but the number of individuals was not counted. auk_zerofill() adds a new binary column, species_observed, indicating whether or not the species was detected.

select(zf, observation_count, species_observed) %>% 
  head(10)
#> # A tibble: 10 × 2
#>   observation_count species_observed
#>   <chr>             <lgl>           
#> 1 0                 FALSE           
#> 2 0                 FALSE           
#> 3 0                 FALSE           
#> 4 0                 FALSE           
#> 5 0                 FALSE           
#> 6 0                 FALSE           
#> # ℹ 4 more rows

Let’s convert the “X”s to NAs and transform observation_count to an integer variable.

zf$observation_count <- if_else(zf$observation_count == "X", 
                                NA_character_, zf$observation_count) %>% 
  as.integer()
select(zf, observation_count, species_observed) %>% 
  head(10)
#> # A tibble: 10 × 2
#>   observation_count species_observed
#>               <int> <lgl>           
#> 1                 0 FALSE           
#> 2                 0 FALSE           
#> 3                 0 FALSE           
#> 4                 0 FALSE           
#> 5                 0 FALSE           
#> 6                 0 FALSE           
#> # ℹ 4 more rows

1.1.4 Filtering data

Now that you have a detection/non-detection dataset, it’s likely that you want to do something with it. For example, you may want to make a map, identify priority areas for a species, or train a species distribution model. Regardless of the specific application, it’s likely that some amount of filtering of the data is required first. Some of the ways you may want to filter eBird data include:

  • Temporal filtering: filter the data to a specific range of years or to a specific time of year.
  • Spatial filtering: filter the data to focus on a specific region, e.g. a protected area.
  • Increasing precision: some eBird checklists are quite long in distance or duration leading to spatial or temporal imprecision. By removing longer checklists we can increase the spatial precision of the dataset.
  • Reducing variation in effort: unlike structured scientific surveys, data can be submitted to eBird using a variety of protocols and there is significant variation in effort between checklists in the eBird dataset. Variation in protocol and effort leads to variation in detectability (more effort generally leads to higher detectability). We can choose to impose more structure on the eBird dataset by filtering to reduce variation in protocol and effort.

The specific filtering you apply will depend on how you intend to use the eBird data. However, for the sake of this example, let’s filter the eBird data to only traveling and stationary checklists from 2013-2022 that are less than 6 hours in duration and 10 km in length.

zf_filtered <- zf %>% 
  filter(year(observation_date) >= 2013, year(observation_date) <= 2022,
         protocol_type %in% c("Traveling", "Stationary"),
         duration_minutes < 6 * 60,
         effort_distance_km < 10 | protocol_type == "Stationary")
nrow(zf)
#> [1] 509294
nrow(zf_filtered)
#> [1] 342881

We reduced the number of checklists by 166,413, but the checklists remaining are of higher quality.

1.1.5 Environmental Covariate Assignment

At this point, if you were planning to run a species distribution model with this data, you’d want some environmental variables as predictors. However, adding environmental variables can be onerous, computationally expensive, and varies based on use case. We provide guidance in our “Best Practices for Using eBird Data” document on extracting environmental variables to use with eBird data. For continuing this work on your own, please use that reference. For the remainder of this workshop, we’re going to skip this process and use the eBird Reference Dataset (ERD) that already contains the environmental variables used in for eBird Status and Trends modeling.

1.2 eBird Reference Dataset (ERD)

The eBird Reference Dataset (ERD) is a subset of the full eBird database created annually for eBird Status and Trends modeling. Only semi-structured (complete checklists with effort information) traveling and stationary counts from the last 15 years are included in the ERD and we assign a set of environmental variables assigned to checklist. In the following sections we’ll provide an introduction to the ERD, describe the associated prediction grid used to make predictions across space, and highlight some of the challenges associated with using eBird data for analysis.

The ERD is distributed in two parts: observation data and checklist data. In the observation dataset, each row corresponds to the sighting of a single species on a checklist, including the count and any other species-level information. In the checklist dataset, each row corresponds to a checklist, including the date, time, location, effort (e.g. distance traveled, time spent, etc.), and any additional checklist-level information.

For this workshop, and extract of the ERD is provided in the workshop data package. The observations and checklsits datasets are provided in parquet format, an open source standard for efficient storage and retrieval of tabular data. If you haven’t already done so, following the instructions in the Introduction to create an RStudio project and download the workshop data package. The parquet files should be located at:

data/ebird_observations_AU-QLD.parquet
data/ebird_checklists_AU-QLD.parquet

Let’s start by reading these two datasets into R using the arrow package and exploring them. We’ll start with the checklist dataset.

library(arrow)
library(auk)
library(dplyr)
library(ebirdst)
library(ggplot2)
library(sf)
library(terra)

checklists <- read_parquet("data/ebird_checklists_AU-QLD_2022.parquet")
glimpse(checklists)
#> Rows: 357,213
#> Columns: 151
#> $ checklist_id                       <int> 10864617, 11307779, 6971840, 102635…
#> $ observer_id                        <int> 267451, 113983, 162908, 113983, 166…
#> $ loc_id                             <chr> "L1560273", "L1650808", "H2561832",…
#> $ longitude                          <dbl> 153, 150, 139, 153, 145, 145, 152, …
#> $ latitude                           <dbl> -26.7, -28.5, -19.0, -27.2, -16.7, …
#> $ year                               <dbl> 2012, 2012, 2008, 2012, 2011, 2011,…
#> $ day_of_year                        <dbl> 128, 212, 180, 85, 284, 283, 34, 31…
#> $ hours_of_day                       <dbl> 11.65, 10.00, 6.75, 11.42, 15.42, 1…
#> $ solar_noon_diff                    <dbl> -0.0376, -2.0229, -2.0756, -0.3446,…
#> $ is_stationary                      <lgl> TRUE, FALSE, FALSE, TRUE, FALSE, FA…
#> $ effort_hrs                         <dbl> 0.167, 0.167, 8.000, 0.333, 0.583, …
#> $ effort_distance_km                 <dbl> 0.000, 2.000, 5.000, 0.000, 1.500, …
#> $ effort_speed_kmph                  <dbl> 0.000, 11.976, 0.625, 0.000, 2.573,…
#> $ num_observers                      <int> 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1,…
#> $ moon_fraction                      <dbl> 0.9874, 0.8842, 0.3158, 0.0594, 0.9…
#> $ moon_altitude                      <dbl> -0.7028, -0.6658, 0.6386, 0.6711, -…
#> $ cds_u10                            <dbl> -0.1342, 1.9947, -5.1339, -3.9891, …
#> $ cds_v10                            <dbl> -0.7690, 2.7611, 0.5347, 2.1257, -0…
#> $ cds_d2m                            <dbl> 8.580, 4.629, 8.287, 15.854, 11.582…
#> $ cds_t2m                            <dbl> 23.02, 8.29, 24.84, 23.38, 34.93, 3…
#> $ cds_hcc                            <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, …
#> $ cds_i10fg                          <dbl> 4.03, 8.04, 9.83, 10.01, 9.44, 8.83…
#> $ cds_mcc                            <dbl> 0.000000, 0.000000, 0.000000, 0.695…
#> $ cds_lcc                            <dbl> 0.0000, 0.0000, 0.0000, 0.7405, 0.0…
#> $ cds_sf                             <dbl> 8.67e-19, 0.00e+00, 0.00e+00, 0.00e…
#> $ cds_rf                             <dbl> 6.07e-18, 0.00e+00, 0.00e+00, 2.95e…
#> $ cds_slc                            <dbl> -133.21, -11.49, -12.35, -25.82, -1…
#> $ cds_msl                            <dbl> 101650, 102426, 101786, 101845, 100…
#> $ eastness_1km_median                <dbl> 3.61e-02, 8.04e-08, 4.35e-02, 1.12e…
#> $ eastness_1km_sd                    <dbl> 0.23578, 0.03816, 0.20141, 0.25066,…
#> $ eastness_90m_median                <dbl> 7.59e-02, -2.76e-04, 1.51e-03, 2.74…
#> $ eastness_90m_sd                    <dbl> 0.16464, 0.00216, 0.02724, 0.06283,…
#> $ northness_1km_median               <dbl> -7.84e-03, 7.70e-02, -6.40e-02, 5.0…
#> $ northness_1km_sd                   <dbl> 0.0950, 0.0705, 0.2609, 0.1821, 0.3…
#> $ northness_90m_median               <dbl> -0.007993, 0.000385, -0.001088, -0.…
#> $ northness_90m_sd                   <dbl> 0.13644, 0.00182, 0.01958, 0.05712,…
#> $ elevation_250m_median              <dbl> 317.450, 218.982, 143.838, 64.140, …
#> $ elevation_250m_sd                  <dbl> 47.518, 0.878, 6.444, 18.788, 10.73…
#> $ elevation_30m_median               <dbl> 314.7, 217.5, 140.8, 54.8, 394.3, 3…
#> $ elevation_30m_sd                   <dbl> 55.56, 5.03, 9.73, 24.47, 13.24, 13…
#> $ island                             <dbl> 30005, 30005, 30005, 30005, 30005, …
#> $ astwbd_c1_ed                       <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ astwbd_c1_pland                    <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
#> $ astwbd_c2_ed                       <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, …
#> $ astwbd_c2_pland                    <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, …
#> $ astwbd_c3_ed                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ astwbd_c3_pland                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ gsw_c2_pland                       <dbl> 0.00000, 0.95312, 0.31391, 0.02483,…
#> $ gsw_c2_ed                          <dbl> 0.000, 10.504, 3.197, 0.550, 0.184,…
#> $ gsw_c3_pland                       <dbl> 0.00000, 0.19215, 0.12675, 0.00993,…
#> $ gsw_c3_ed                          <dbl> 0.00, 3.01, 1.58, 0.22, 0.00, 0.00,…
#> $ intertidal_c1_ed                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ intertidal_c1_pland                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ ntl_mean                           <dbl> 0.0000, 7.1068, 0.0000, 0.2118, 0.2…
#> $ ntl_sd                             <dbl> 0.0000, 5.4641, 0.0000, 0.3992, 0.4…
#> $ road_density_c1                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ road_density_c2                    <dbl> 0, 681, 0, 545, 422, 422, 0, 0, 0, …
#> $ road_density_c3                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ road_density_c4                    <dbl> 410, 321, 631, 0, 80, 80, 287, 287,…
#> $ road_density_c5                    <dbl> 0, 0, 0, 766, 0, 0, 0, 0, 0, 0, 0, …
#> $ mcd12q1_lccs1_c1_ed                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c1_pland             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c2_ed                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c2_pland             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c11_ed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c11_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c12_ed               <dbl> 9.50, 0.00, 0.00, 1.69, 0.00, 0.00,…
#> $ mcd12q1_lccs1_c12_pland            <dbl> 53.60, 0.00, 0.00, 1.00, 0.00, 0.00…
#> $ mcd12q1_lccs1_c13_ed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c13_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c14_ed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c14_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c15_ed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c15_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c16_ed               <dbl> 0.00, 0.00, 0.00, 0.00, 1.76, 1.76,…
#> $ mcd12q1_lccs1_c16_pland            <dbl> 0.000, 0.000, 0.000, 0.000, 3.001, …
#> $ mcd12q1_lccs1_c21_ed               <dbl> 11.66, 0.00, 0.00, 5.08, 0.00, 0.00…
#> $ mcd12q1_lccs1_c21_pland            <dbl> 43.72, 0.00, 0.00, 9.40, 0.00, 0.00…
#> $ mcd12q1_lccs1_c22_ed               <dbl> 2.158, 6.516, 0.899, 9.734, 4.405, …
#> $ mcd12q1_lccs1_c22_pland            <dbl> 2.678, 41.444, 0.776, 76.168, 90.99…
#> $ mcd12q1_lccs1_c31_ed               <dbl> 0.00, 6.52, 7.64, 4.66, 2.64, 2.64,…
#> $ mcd12q1_lccs1_c31_pland            <dbl> 0.00, 58.56, 32.62, 13.43, 6.00, 6.…
#> $ mcd12q1_lccs1_c32_ed               <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ mcd12q1_lccs1_c32_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c41_ed               <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ mcd12q1_lccs1_c41_pland            <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 24.9,…
#> $ mcd12q1_lccs1_c42_ed               <dbl> 0.0, 0.0, 4.5, 0.0, 0.0, 0.0, 14.7,…
#> $ mcd12q1_lccs1_c42_pland            <dbl> 0.00, 0.00, 9.01, 0.00, 0.00, 0.00,…
#> $ mcd12q1_lccs1_c43_ed               <dbl> 0.0, 0.0, 10.3, 0.0, 0.0, 0.0, 0.0,…
#> $ mcd12q1_lccs1_c43_pland            <dbl> 0.0, 0.0, 57.6, 0.0, 0.0, 0.0, 0.0,…
#> $ mcd12q1_lccs1_c255_ed              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c255_pland           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs2_c25_ed               <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
#> $ mcd12q1_lccs2_c25_pland            <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
#> $ mcd12q1_lccs2_c35_ed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs2_c35_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs2_c36_ed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs2_c36_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs3_c27_ed               <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ mcd12q1_lccs3_c27_pland            <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ mcd12q1_lccs3_c50_ed               <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ mcd12q1_lccs3_c50_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs3_c51_ed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs3_c51_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ has_shoreline                      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ shoreline_waveheight_mean          <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, …
#> $ shoreline_waveheight_sd            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_tidal_range_mean         <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ shoreline_tidal_range_sd           <dbl> 0.00000, 0.00000, 0.00000, 0.00000,…
#> $ shoreline_chlorophyll_mean         <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ shoreline_chlorophyll_sd           <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0…
#> $ shoreline_turbidity_mean           <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, …
#> $ shoreline_turbidity_sd             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_sinuosity_mean           <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ shoreline_sinuosity_sd             <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ shoreline_slope_mean               <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ shoreline_slope_sd                 <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ shoreline_outflow_density_mean     <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e…
#> $ shoreline_outflow_density_sd       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_erodibility_n            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_erodibility_c1_density   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_erodibility_c2_density   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_erodibility_c3_density   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_erodibility_c4_density   <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
#> $ shoreline_emu_physical_n           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c1_density  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c2_density  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c3_density  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c4_density  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c5_density  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c6_density  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c7_density  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c8_density  <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
#> $ shoreline_emu_physical_c9_density  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c10_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c11_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c12_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c13_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c14_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c15_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c16_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c17_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c18_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c19_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c20_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c21_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c22_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c23_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ has_evi                            <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
#> $ evi_median                         <dbl> 56375768, 26577095, 16683584, 49761…
#> $ evi_sd                             <dbl> 4978114, 6436081, 5444980, 6107690,…

There are a huge number of columns in this data frame. The first set of variables provide standard information about the checklist: where and when did the observation occur, what type of search was conducted, and how much search effort was expended. Two important differences exist between these variables and what you will see if you look at the raw eBird dataset: when a GPS track is available we replace the checklist or hotspot location (latitude/longitude) with the centroid of the track and the time of the checklist is expressed as the difference between the checklist midpoint and solar noon, a more ecologically meaningful quantity.

All the remaining variables are not collected in eBird, they’re calculated and added by the Status and Trends team based on external data sets. First, those variables beginning with cds_, provides information about the weather at the time of the observation, which can impact detectibility. This is followed by a large suite of environmental variables summarized over a 3km diameter circular neighborhood around the checklist location, including variables describing: elevation and topography, land and water cover, roads, and night time lights (a proxy for urban development). Most variables are summarized as two quantities expressing composition (what habitat is available) and configuration (how that habitat is arranged spatially). For continuous variables, such as elevation, we use the median and standard deviation. For categorical variables, such as land cover class, we use percent landcover (pland) and edge density (ed).

Example of calculating percent land cover and edge density for a 3km diamter circular neighborhood centered on a checklist location. pland for each class is the percent of the circle covered by that class. To calculate ed for each class, we add up the perimeter lengths of all patches of that class, then divide by the area of the circle.

The land and water cover variables can be challenging to interpret based on their names alone (e.g. mcd12q1_lccs1_c12_pland); however, these names can be looked up in the ebirdst_predictors data frame from the ebirdst package. For example, let’s look up what mcd12q1_lccs1_c12_pland corresponds to.

filter(ebirdst_predictors, predictor == "mcd12q1_lccs1_c12_pland") %>% 
  select(predictor, label)
#> # A tibble: 1 × 2
#>   predictor               label                                
#>   <chr>                   <chr>                                
#> 1 mcd12q1_lccs1_c12_pland Evergreen Broadleaf Forests (% cover)
Checkpoint

Take some time to explore the variables in the checklist dataset. Try looking up a variable in ebirdst_predictors. Ask for help if you need clarification on the meaning of any of the variables.

Now let’s look at the observation dataset.

observations <- read_parquet("data/ebird_observations_AU-QLD_2022.parquet")
glimpse(observations)
#> Rows: 7,292,576
#> Columns: 5
#> $ checklist_id <dbl> 10928575, 10928575, 10928575, 10928575, 10928575, 1092857…
#> $ species_code <chr> "rensti", "faecur", "grsplo", "strher", "cursan", "batgod…
#> $ valid        <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
#> $ obs_detected <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ obs_count    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

This is a much simpler dataset with only five columns:

  • checklist_id: unique identifier for the checklist that this observation belongs to. Allows joining the observation data to the checklist data.
  • species_code: unique identifier for the species that this observation was made for.
  • valid: a binary variable indicating is the observation was determined to be valid (TRUE) or invalid (FALSE) by the eBird reviewers.
  • obs_detected: a binary variable indicating if the species was detected (1) or not detected (0). Since this is a dataset of observations only, obs_detected is always 1; however, having this variable will become useful when we join to the checklist dataset in the next section.
  • obs_count: count of the number of individuals or an NA if no count was provided (if an “X” was entered for count on the eBird checklist).
Tip

To look up the common name or scientific name of a species try appending the species code to the URL https://ebird.org/species/. For example, visit https://ebird.org/species/maslap1 to look up the species code maslap1. This information is also available in the ebird_taxonomy data frame in the auk package.

filter(ebird_taxonomy, species_code == "maslap1") %>% 
  select(species_code, common_name, scientific_name, family)
#>   species_code    common_name scientific_name       family
#> 1      maslap1 Masked Lapwing  Vanellus miles Charadriidae

1.2.1 Zero-filling eBird data

Complete eBird checklists are extremely valuable because, for all species that weren’t reported, we can infer counts of 0. This allows us to convert eBird from presence only data to detection/non-detection data, which allows for much more robust analyses. Note that we don’t use the term presence/absence data here because a non-detection doesn’t necessarily imply the species was absent, only that observer wasn’t able to detect and identify it.

We refer to the process of producing detection/non-detection data as “zero-filling” the eBird data because we’re filling in the missing zeros. Let’s consider observations of Shining Bronze-Cuckoo (species code shbcuc1).

bird_detections <- observations %>% 
  filter(species_code == "shbcuc1") %>% 
  select(checklist_id, valid, obs_detected, obs_count)

Next join this set of detections to the complete set of checklists, including detections and non-detections.

bird_all <- left_join(checklists, bird_detections, by = "checklist_id") %>%
  select(checklist_id, latitude, longitude, year, day_of_year,
         valid, obs_detected, obs_count)
head(bird_all)
#> # A tibble: 6 × 8
#>   checklist_id latitude longitude  year day_of_year valid obs_detected obs_count
#>          <dbl>    <dbl>     <dbl> <dbl>       <dbl> <lgl>        <int>     <int>
#> 1     10864617    -26.7      153.  2012         128 NA              NA        NA
#> 2     11307779    -28.5      150.  2012         212 NA              NA        NA
#> 3      6971840    -19.0      139.  2008         180 NA              NA        NA
#> 4     10263519    -27.2      153.  2012          85 TRUE             1         1
#> 5      9020044    -16.7      145.  2011         284 NA              NA        NA
#> 6      9009843    -16.7      145.  2011         283 NA              NA        NA

Finally, for rows where the bird was not detected we can replace the missing counts with 0. At this time, we recommend removing any checklists with valid == 0 because there is uncertainty about whether or not the species was detected. Let’s also filter to a subset of months, to keep the data smaller and relevant to a particular season.

bird_zf <- bird_all %>% 
  filter(is.na(valid) | valid == 1) %>% 
  mutate(
    # checklist not in the observations dataset are non-detections
    obs_detected = coalesce(obs_detected, 0L),
    # non-detections correspond to a count of 0
    obs_count = if_else(obs_detected == 1, obs_count, 0)
  ) %>%
  # approximately november through january
  filter(day_of_year >= 305 | day_of_year < 32)

We can now, for example, make a map of observations. We’ll use spatial data that was prepared in advance and provided in the data package.

# load and project gis data
map_proj <- "+proj=laea +lon_0=146.95 +lat_0=-19.15 +datum=WGS84 +units=m +no_defs"
ne_land <- read_sf("data/gis-data.gpkg", "ne_land") %>% 
  st_transform(crs = map_proj) %>% 
  st_geometry()
ne_country_lines <- read_sf("data/gis-data.gpkg", "ne_country_lines") %>% 
  st_transform(crs = map_proj) %>% 
  st_geometry()
ne_state_lines <- read_sf("data/gis-data.gpkg", "ne_state_lines") %>% 
  st_transform(crs = map_proj) %>% 
  st_geometry()
target_state <- read_sf("data/gis-data.gpkg", "regions") %>% 
  filter(state_code == "AU-QLD") %>% 
  st_transform(crs = map_proj) %>% 
  st_geometry()

# prepare ebird data for mapping
bird_sf <- bird_zf %>% 
  # convert to spatial points
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>% 
  st_transform(crs = map_proj)

# map
par(mar = c(0.25, 0.25, 0.25, 0.25))
# set up plot area
plot(st_geometry(target_state), col = NA, border = NA)
# contextual gis data
plot(ne_land, col = "#cfcfcf", border = "#888888", lwd = 0.5, add = TRUE)
plot(target_state, col = "#e6e6e6", border = NA, add = TRUE)
plot(ne_state_lines, col = "#ffffff", lwd = 0.75, add = TRUE)
plot(ne_country_lines, col = "#ffffff", lwd = 1.5, add = TRUE)
# ebird observations
# all
plot(bird_sf,
     pch = 19, cex = 0.1, col = scales::alpha("#555555", 4),
     add = TRUE)
# detection
plot(filter(bird_sf, obs_detected == 1),
     pch = 19, cex = 0.3, col = scales::alpha("#4daf4a", 1),
     add = TRUE)
# legend
legend("bottomright", bty = "n",
       col = c("#555555", "#4daf4a"),
       legend = c("eBird checklists", "Bird sightings"),
       pch = 19)
box()
par(new = TRUE, mar = c(0, 0, 3, 0))
title("Shining Bronze-Cuckoo eBird Observations\nNovember-January 2007-2022")

Exercise

Try producing zero-filled, detection/non-detection data for another species.

For example, to produce detection/non-detection data for Masked Lapwing use:

sp_zf <- observations %>% 
  filter(species_code == "maslap1") %>% 
  left_join(checklists, ., by = "checklist_id") %>% 
  filter(is.na(valid) | valid == 1) %>% 
  mutate(obs_detected = coalesce(obs_detected, 0),
         obs_count = if_else(obs_detected == 1, obs_count, 0)) %>% 
  select(checklist_id, obs_detected, obs_count)
head(sp_zf)
#> # A tibble: 6 × 3
#>   checklist_id obs_detected obs_count
#>          <dbl>        <dbl>     <dbl>
#> 1     10864617            0         0
#> 2     11307779            1         2
#> 3      6971840            0         0
#> 4     10263519            0         0
#> 5      9020044            0         0
#> 6      9009843            0         0

1.2.2 Prediction grid

The ultimate goal of modeling the occurrence or abundance of a species is frequently to produce a map showing the distribution of that species in space. To do so, we need to know the values of our predictor variables over the region that we intend to make predictions. To make this possible, the ERD is distributed with a prediction grid: a regular grid of points covering the entire globe spaced 3km apart for which all the environmental variables have been calculated for the year 2022.

The data package for this course contains an example subset of the prediction grid. The file data/prediction-grid_year_AU-QLD.parquet contains the environmental variables for each point on the grid and the file data/prediction-grid_template.tif is a 3km by 3km raster template where each each cell center is a point on the prediction grid. Let’s start by examining the environmental variables.

prediction_grid <- read_parquet("data/prediction-grid_year_AU-QLD_2022.parquet")
glimpse(prediction_grid)
#> Rows: 230,422
#> Columns: 123
#> $ srd_id                             <int> 50299954, 50299955, 50299956, 50299…
#> $ longitude                          <dbl> 142, 142, 142, 142, 142, 142, 142, …
#> $ latitude                           <dbl> -9.19, -9.19, -9.19, -9.19, -9.19, …
#> $ eastness_1km_median                <dbl> -8.11e-03, -9.89e-03, 4.66e-02, 3.1…
#> $ eastness_1km_sd                    <dbl> 0.0875, 0.1937, 0.1585, 0.1373, 0.1…
#> $ eastness_90m_median                <dbl> 6.06e-04, 5.23e-05, 1.53e-04, -2.61…
#> $ eastness_90m_sd                    <dbl> 0.00474, 0.00407, 0.00657, 0.01536,…
#> $ northness_1km_median               <dbl> -6.06e-02, 1.53e-01, 7.75e-02, -5.6…
#> $ northness_1km_sd                   <dbl> 0.3412, 0.1793, 0.2380, 0.1677, 0.3…
#> $ northness_90m_median               <dbl> -0.001320, -0.000886, 0.000655, 0.0…
#> $ northness_90m_sd                   <dbl> 0.01242, 0.01134, 0.01204, 0.01721,…
#> $ elevation_250m_median              <dbl> 14.513, 13.969, 14.086, 0.174, 13.9…
#> $ elevation_250m_sd                  <dbl> 2.67, 3.22, 3.96, 7.31, 5.04, 5.78,…
#> $ elevation_30m_median               <dbl> 14.1, 13.3, 12.2, 13.1, 14.4, 17.5,…
#> $ elevation_30m_sd                   <dbl> 5.06, 5.09, 4.50, 4.40, 5.30, 5.22,…
#> $ island                             <dbl> 392, 392, 392, 392, 1067, 1067, 106…
#> $ astwbd_c1_ed                       <dbl> 2.26, 2.26, 5.95, 5.95, 9.26, 14.02…
#> $ astwbd_c1_pland                    <dbl> 2.20, 1.21, 11.17, 81.91, 42.20, 37…
#> $ astwbd_c2_ed                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ astwbd_c2_pland                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ astwbd_c3_ed                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ astwbd_c3_pland                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ gsw_c2_pland                       <dbl> 0.823, 6.787, 8.823, 7.259, 1.704, …
#> $ gsw_c2_ed                          <dbl> 5.27, 18.01, 16.18, 11.68, 14.06, 2…
#> $ gsw_c3_pland                       <dbl> 1.8037, 0.5269, 11.5355, 81.5255, 3…
#> $ gsw_c3_ed                          <dbl> 2.228, 1.398, 5.633, 5.833, 9.295, …
#> $ intertidal_c1_ed                   <dbl> 2.24, 2.16, 6.65, 16.81, 13.13, 7.9…
#> $ intertidal_c1_pland                <dbl> 2.37, 1.06, 12.20, 19.01, 10.81, 18…
#> $ ntl_mean                           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ ntl_sd                             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ road_density_c1                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ road_density_c2                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ road_density_c3                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ road_density_c4                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ road_density_c5                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c1_ed                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c1_pland             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c2_ed                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c2_pland             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c11_ed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c11_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c12_ed               <dbl> 7.338, 10.131, 8.094, 3.083, 8.544,…
#> $ mcd12q1_lccs1_c12_pland            <dbl> 17.695, 36.156, 44.618, 4.919, 26.6…
#> $ mcd12q1_lccs1_c13_ed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c13_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c14_ed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c14_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c15_ed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c15_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c16_ed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c16_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c21_ed               <dbl> 12.52, 13.21, 16.19, 7.49, 13.94, 1…
#> $ mcd12q1_lccs1_c21_pland            <dbl> 68.50, 28.15, 40.24, 19.06, 33.23, …
#> $ mcd12q1_lccs1_c22_ed               <dbl> 2.16, 7.05, 3.15, 0.00, 0.00, 0.00,…
#> $ mcd12q1_lccs1_c22_pland            <dbl> 3.10, 18.50, 6.22, 0.00, 0.00, 0.00…
#> $ mcd12q1_lccs1_c31_ed               <dbl> 4.75, 6.61, 2.70, 0.00, 0.00, 0.00,…
#> $ mcd12q1_lccs1_c31_pland            <dbl> 10.64, 17.13, 1.36, 0.00, 0.00, 0.0…
#> $ mcd12q1_lccs1_c32_ed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c32_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c41_ed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c41_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c42_ed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c42_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c43_ed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c43_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c255_ed              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c255_pland           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs2_c25_ed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs2_c25_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs2_c35_ed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs2_c35_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs2_c36_ed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs2_c36_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs3_c27_ed               <dbl> 3.89, 4.85, 7.19, 4.40, 7.19, 8.37,…
#> $ mcd12q1_lccs3_c27_pland            <dbl> 45.60, 52.36, 90.17, 23.97, 59.84, …
#> $ mcd12q1_lccs3_c50_ed               <dbl> 1.295, 6.607, 0.899, 0.000, 0.000, …
#> $ mcd12q1_lccs3_c50_pland            <dbl> 2.573, 13.835, 0.626, 0.000, 0.000,…
#> $ mcd12q1_lccs3_c51_ed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs3_c51_pland            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ has_shoreline                      <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
#> $ shoreline_waveheight_mean          <dbl> 0.380, 0.380, 0.380, 0.380, 0.380, …
#> $ shoreline_waveheight_sd            <dbl> 0.00000, 0.00000, 0.00000, 0.00000,…
#> $ shoreline_tidal_range_mean         <dbl> 4.24, 4.13, 4.09, 4.06, 4.02, 3.93,…
#> $ shoreline_tidal_range_sd           <dbl> 0.0000, 0.0462, 0.0191, 0.0000, 0.0…
#> $ shoreline_chlorophyll_mean         <dbl> 3.70, 3.76, 3.76, 3.84, 3.83, 3.79,…
#> $ shoreline_chlorophyll_sd           <dbl> 0.037448, 0.000638, 0.045479, 0.044…
#> $ shoreline_turbidity_mean           <dbl> 0.1450, 0.1450, 0.1884, 0.2029, 0.0…
#> $ shoreline_turbidity_sd             <dbl> 0.000000, 0.000000, 0.028934, 0.000…
#> $ shoreline_sinuosity_mean           <dbl> 1.04, 1.04, 1.10, 1.10, 1.32, 1.34,…
#> $ shoreline_sinuosity_sd             <dbl> 0.000, 0.000, 0.000, 0.000, 0.405, …
#> $ shoreline_slope_mean               <dbl> 120.0, 13.3, 85.0, 56.0, 89.3, 74.3…
#> $ shoreline_slope_sd                 <dbl> 50.22, 5.74, 48.43, 10.09, 61.16, 4…
#> $ shoreline_outflow_density_mean     <dbl> 1.06e-04, 4.18e-05, 4.18e-05, 4.18e…
#> $ shoreline_outflow_density_sd       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_erodibility_n            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,…
#> $ shoreline_erodibility_c1_density   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_erodibility_c2_density   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_erodibility_c3_density   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_erodibility_c4_density   <dbl> 181, 149, 367, 366, 677, 1088, 1240…
#> $ shoreline_emu_physical_n           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,…
#> $ shoreline_emu_physical_c1_density  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c2_density  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c3_density  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c4_density  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c5_density  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c6_density  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c7_density  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c8_density  <dbl> 181, 149, 367, 366, 677, 1088, 1240…
#> $ shoreline_emu_physical_c9_density  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c10_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c11_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c12_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c13_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c14_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c15_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c16_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c17_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c18_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c19_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c20_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c21_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c22_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c23_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

These variables should be mostly familiar from the ERD, except for srd_id which is a unique identifier for each point on the grid. Next let’s load the raster template using the terra package.

raster_template <- rast("data/prediction-grid_template.tif")
raster_template
#> class       : SpatRaster 
#> dimensions  : 5630, 13511, 1  (nrow, ncol, nlyr)
#> resolution  : 2963, 2963  (x, y)
#> extent      : -2e+07, 2e+07, -6673060, 1e+07  (xmin, xmax, ymin, ymax)
#> coord. ref. : +proj=sinu +lon_0=0 +x_0=0 +y_0=0 +R=6371007.181 +units=m +no_defs 
#> source      : prediction-grid_template.tif 
#> name        : mask 
#> min value   :    1 
#> max value   :    1

This is a global 2.96km by 2.96km square grid in an equal area projection. We can use the terra function rasterize to insert values from the prediction grid into the template for mapping. For example, let’s make a raster dataset of percent cover of evergreen broadleaf forest (mcd12q1_lccs1_c12_pland).

forest_cover <- prediction_grid %>% 
  # convert to spatial object using sf
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>% 
  # transform to the coordinate reference system of the raster
  st_transform(crs = crs(raster_template)) %>% 
  # rasterize the points using the raster template
  rasterize(raster_template, field = "mcd12q1_lccs1_c12_pland")

Now we can make a simple map of evergreen broadleaf forest. Note that the raster template is global, but we can use trim() to remove all areas that have missing values.

plot(trim(forest_cover), axes = FALSE)

The map looks distorted because the prediction grid uses a sinusoidal projection, which works well for analysis but not for mapping. In the next lesson, we’ll demonstrate how to project data into a coordinate reference system more suitable for mapping.

1.2.3 Spatial and temporal bias

Despite the strengths of eBird data, species observations collected through citizen science projects exhibit both spatial and temporal bias requiring special care when using them for rigorous analyses. Spatial bias occurs because eBird participants are more likely to be collect data near their homes, in easily accessible areas such as roadsides, or in areas known to be good for birding. Looking at the above map of bird observations it’s clear that the eBird checklists are clustered around cities and roads. Temporal bias occurs because participants preferentially collect data when they are available, such as weekends, and at times of year when they expect to observe more birds, notably during the breeding season. We can plot the distribution of checklists over the days of the year to see this bias:

checklist_per_day <- checklists %>% 
  filter(day_of_year < 366) %>% 
  count(day_of_year)
ggplot(checklist_per_day) +
  aes(x = day_of_year, y = n) +
  geom_line() +
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Day of Year", y = "# checklists",
       title = "Daily eBird checklists submission") +
  theme_bw()

Three is a clear seasonal pattern to the number of eBird checklists submitted (fewer checklists in March) as well as daily and weekly variation within seasons. In addition, there are spikes in checklists submissions. What do you think could be causing these sudden increases?

Finally, for most species, there is strong class imbalance in the data, meaning there are usually many more non-detections than detections. As a results, a distribution model predicting that the species is absent everywhere will have high accuracy, but no ecological value. For example, the prevalence rate of this species is only 3%.

mean(bird_zf$obs_detected)
#> [1] 0.0275

To address these three issues (spatial bias, temporal bias, and class imbalance) we recommend subsampling the data using a technique called case controlled grid sampling. We overlay an equal area 3km by 3km grid over the checklists, then sample one detection and one non-detection from each grid cell for each week of each year. Let’s look at a simple example of how spatial grid sampling works.”

1. Take one week of eBird observations. Detections are show in green and non-detections are shown in gray.

2. Separate the detections and non-detections. In this example, there is a higher density of observations in the lower right corner of the region and the prevalence of detections is 2%.

3. Overlay an equal area grid on top of the points, For Status and Trends we use a 3km by 3km grid.

4. Sample one checklist from each grid cell.

5. Recombine the detections and non-detections. The observations are much more evenly distributed in space and the prevalence of detections has increased from 2% to 20%.

The function grid_sample_stratified() from the ebirdst package is specifically designed to perform case controlled grid sampling on eBird data. For example, let’s apply this technique to the bird observations.

# perform case controlled grid sampling
bird_sampled <- grid_sample_stratified(bird_zf, obs_column = "obs_detected")

# how many checklists were removed?
nrow(bird_zf)
#> [1] 90620
nrow(bird_sampled)
#> [1] 33265

# how has prevalence changed
mean(bird_zf$obs_detected)
#> [1] 0.0275
mean(bird_sampled$obs_detected)
#> [1] 0.0435

So, after sampling, we’re left with 37% of the observations we started with, but the spatial and temporal bias has been significantly reduced.

We now have the data and tools necessary to model relative abundance using eBird data, which will be the focus of Lesson 2.