library(auk)
library(dplyr)
library(ggplot2)
library(lubridate)
library(sf)
<- "data/ebd_AU-QLD_shbcuc1_smp_relSep-2023_sampling.txt"
f_sed <- read_sampling(f_sed, unique = FALSE)
checklists glimpse(checklists)
#> Rows: 755,445
#> Columns: 30
#> $ last_edited_date <chr> "2021-04-19 02:40:19.157994", "2021-10-22 20…
#> $ country <chr> "Australia", "Australia", "Australia", "Aust…
#> $ country_code <chr> "AU", "AU", "AU", "AU", "AU", "AU", "AU", "A…
#> $ state <chr> "Queensland", "Queensland", "Queensland", "Q…
#> $ state_code <chr> "AU-QLD", "AU-QLD", "AU-QLD", "AU-QLD", "AU-…
#> $ county <chr> "Brisbane", "Brisbane", "Brisbane", "Brisban…
#> $ county_code <chr> "AU-QLD-BRI", "AU-QLD-BRI", "AU-QLD-BRI", "A…
#> $ iba_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ bcr_code <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ usfws_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ atlas_block <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ locality <chr> "Dowse Lagoon (Sandgate)", "Dowse Lagoon (Sa…
#> $ locality_id <chr> "L1896576", "L1896576", "L1896576", "L189657…
#> $ locality_type <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H",…
#> $ latitude <dbl> -27.3, -27.3, -27.3, -27.3, -27.3, -27.3, -2…
#> $ longitude <dbl> 153, 153, 153, 153, 153, 153, 153, 153, 153,…
#> $ observation_date <date> 2021-04-19, 2021-10-23, 2021-04-25, 2021-11…
#> $ time_observations_started <chr> "15:40:00", "06:04:00", "12:25:00", "13:53:0…
#> $ observer_id <chr> "obs427309", "obs427309", "obs535737", "obs9…
#> $ sampling_event_identifier <chr> "S85875959", "S96568064", "S86299492", "S978…
#> $ protocol_type <chr> "Traveling", "Traveling", "Stationary", "Tra…
#> $ protocol_code <chr> "P22", "P22", "P21", "P22", "P22", "P21", "P…
#> $ project_code <chr> "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD_A…
#> $ duration_minutes <int> 59, 138, 33, 46, 75, 16, 67, 104, 28, 45, NA…
#> $ effort_distance_km <dbl> 1.954, 3.437, NA, 1.610, 1.030, NA, 1.310, 2…
#> $ effort_area_ha <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ number_observers <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ all_species_reported <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
#> $ group_identifier <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ trip_comments <chr> NA, NA, NA, "Clear skies light breeze. Targe…
1 eBird Data
eBird data are collected and organized around the concept of a checklist, representing observations from a single birding event, such as a 1 km walk through a park or 15 minutes observing bird feeders in your backyard. All eBird checklists contains a list of species observed and the location and time of the observations. For a subset of checklists, the observer will also provide counts of the number of individuals seen of each species, specify the amount of effort expended while collecting these data, and confirm that they are submitting a complete checklist of all the birds they were able to identify. The data provided by the checklists such as these is often referred to as semi-structured citizen science data.
Let’s compare two eBird checklists: an incidental observation with missing counts and a complete traveling count. Both checklists can be useful, but only the second checklist provides the type of semi-structured data required for more rigorous applications.
The first part of this lesson will work with the eBird Basic Dataset (EBD). This will allow users to prepare eBird data for relative abundance modeling up to the point of assigning environmental variables as predictors and will be useful for users after the workshop. For the sake of brevity, the second part of this lesson will provide data from the eBird Reference Dataset (ERD), the collection of semi-structured eBird data used by the eBird Status and Trends team for modeling species distributions and relative abundance. A subset of this dataset will be provided in the data package for this workshop.
1.1 raw eBird Data (EBD)
eBird data are released as two tab-separated text files: the eBird Basic Dataset (EBD) containing observation data and the Sampling Event Data (SED) containing checklist data. These files are released monthly and contain all validated bird sightings in the eBird database at the time of release. In the EBD, each row corresponds to the sighting of a single species on a checklist, including the count and any other species-level information (e.g. age, sex, species comments, etc.). In the SED, each row corresponds to a checklist, including the date, time, location, effort (e.g. distance traveled, time spent, etc.), and any additional checklist-level information (e.g. whether this is a complete checklist or not).
1.1.1 Downloading data
In this workshop, we’ll use Shining Bronze-Cuckoo observations from Queensland, Australia as an example. We’ll start by downloading the corresponding eBird observation (EBD) and checklist (SED) data by visiting the eBird Basic Dataset download page and filling out the Custom Download form to request Shining Bronze-Cuckoo observations from Queensland. Make sure you check the box “Include sampling event data”, which will include the SED in the data download in addition to the EBD.
The eBird database contains a massive amount of data! When requesting eBird data to download it’s important to narrow the request to as small a subset of the data as possible. For example, if we request all Shining Bronze-Cuckoo observations globally, the dataset may be too large to work with in R. Instead, we’ve only requested data for a single state in Australia.
Once the data are ready, you will receive an email with a download link. The downloaded data will be in a compressed .zip format, and should be unarchived. The resulting directory will contain a two text files: one for the EBD (e.g. ebd_AU-QLD_shbcuc1_smp_relSep-2023.txt
) containing all the Shining Bronze-Cuckoo observations from Queensland, and one for the SED (e.g. ebd_AU-QLD_shbcuc1_smp_relSep-2023_sampking.txt
) containing all checklists from Queensland, The relSep-2023
component of the file name describes which version of the EBD this dataset came from; in this case it’s the September 2023 release.
If you would prefer to directly download the exact dataset used in this workshop, download the data package for this workshop.
1.1.2 Importing eBird data into R
The previous step left us with two tab separated text files, one for the EBD (i.e. observation data) and one for the SED (i.e. checklist data). Start a new RStudio project and put the downloaded text files in the data/
sub-directory of the project directory.
The auk
R package is specifically designed for working with eBird data. It includes the functions read_ebd()
and read_sampling()
for importing the EBD and SED, respectively, into R. First let’s import the checklist data (SED).
Take some time to explore the variables in the checklist dataset. If you’re unsure about any of the variables, consult the metadata document that came with the data download (eBird_Basic_Dataset_Metadata_v1.14.pdf
).
For some applications, only the checklist data are required. For example, the checklist data can be used to investigate the spatial and temporal distribution of eBird data within a region. This dataset can also be used to explore how much variation there is in the observation effort variables and identify checklists that have low spatial or temporal precision.
Next, let’s load the observation data.
<- "data/ebd_AU-QLD_shbcuc1_smp_relSep-2023.txt"
f_ebd <- read_ebd(f_ebd, unique = FALSE, rollup = FALSE)
observations glimpse(observations)
#> Rows: 23,250
#> Columns: 49
#> $ global_unique_identifier <chr> "URN:CornellLabOfOrnithology:EBIRD:OBS90068…
#> $ last_edited_date <chr> "2020-04-23 01:19:06", "2021-03-24 06:05:52…
#> $ taxonomic_order <dbl> 3281, 3281, 3281, 3284, 3281, 3281, 3281, 3…
#> $ category <chr> "species", "species", "species", "issf", "s…
#> $ taxon_concept_id <chr> "avibase-19268395", "avibase-19268395", "av…
#> $ common_name <chr> "Shining Bronze-Cuckoo", "Shining Bronze-Cu…
#> $ scientific_name <chr> "Chrysococcyx lucidus", "Chrysococcyx lucid…
#> $ subspecies_common_name <chr> NA, NA, NA, "Shining Bronze-Cuckoo (Shining…
#> $ subspecies_scientific_name <chr> NA, NA, NA, "Chrysococcyx lucidus lucidus",…
#> $ exotic_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ observation_count <chr> "1", "1", "1", "1", "1", "3", "2", "1", "1"…
#> $ breeding_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ breeding_category <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ behavior_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ age_sex <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ country <chr> "Australia", "Australia", "Australia", "Aus…
#> $ country_code <chr> "AU", "AU", "AU", "AU", "AU", "AU", "AU", "…
#> $ state <chr> "Queensland", "Queensland", "Queensland", "…
#> $ state_code <chr> "AU-QLD", "AU-QLD", "AU-QLD", "AU-QLD", "AU…
#> $ county <chr> "Mackay", "Moreton Bay", "Ipswich", "Sunshi…
#> $ county_code <chr> "AU-QLD-MAC", "AU-QLD-MRB", "AU-QLD-IPS", "…
#> $ iba_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ bcr_code <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ usfws_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ atlas_block <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ locality <chr> "Finch Hatton Gorge", "Lake Samsonvale--Gol…
#> $ locality_id <chr> "L921597", "L692632", "L4784405", "L1827741…
#> $ locality_type <chr> "H", "H", "P", "H", "H", "H", "P", "H", "H"…
#> $ latitude <dbl> -21.1, -27.3, -27.5, -26.6, -27.7, -27.3, -…
#> $ longitude <dbl> 149, 153, 153, 153, 153, 153, 153, 153, 153…
#> $ observation_date <date> 2019-08-30, 2019-10-23, 2019-09-21, 2019-0…
#> $ time_observations_started <chr> "05:51:00", "05:09:00", "06:30:00", "09:41:…
#> $ observer_id <chr> "obsr1490162", "obsr277790", "obsr186524", …
#> $ sampling_event_identifier <chr> "S67655665", "S61014875", "S59994086", "S53…
#> $ protocol_type <chr> "Stationary", "Traveling", "Traveling", "Tr…
#> $ protocol_code <chr> "P21", "P22", "P22", "P22", "P22", "P22", "…
#> $ project_code <chr> "EBIRD_AU", "EBIRD_AU", "EBIRD", "EBIRD", "…
#> $ duration_minutes <int> 540, 361, 95, 61, 240, 115, 70, 85, 60, 141…
#> $ effort_distance_km <dbl> NA, 5.240, 0.701, 2.000, 4.500, 1.000, 1.27…
#> $ effort_area_ha <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ number_observers <int> 1, 1, 1, 2, 2, 1, 2, 1, 1, 2, 3, 1, 1, 4, 1…
#> $ all_species_reported <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
#> $ group_identifier <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "G43416…
#> $ has_media <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
#> $ approved <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
#> $ reviewed <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
#> $ reason <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ trip_comments <chr> "Transferred from Daily Checklist.", NA, NA…
#> $ species_comments <chr> "Heard.", NA, NA, NA, NA, NA, "1 immature."…
Take some time to explore the variables in the observation dataset. Notice that the EBD duplicates many of the checklist-level variables from the SED.
When we read the data into R, we used unique = FALSE
and rollup = FALSE
. By default the read functions in auk
perform two important pre-processing steps: combining duplicate shared checklists and taxonomic rollup. We intentionally turned off this functionality for the purposes of demonstration.
1.1.2.2 Taxonomic rollup
eBird observations can be made at levels below species (e.g. subspecies) or above species (e.g. a bird that was identified as a duck, but the species could not be determined); however, for most uses we’ll want observations at the species level. This is especially true if we want to produce detection/non-detection data from complete checklists because “complete” only applies at the species level.
In the example dataset used for this workshop, these taxonomic issues don’t apply. We have requested Shining Bronze-Cuckoo observations, so we haven’t received any observations for taxa above species. However, there are records of Shining Bronze-Cuckoo (Shining) and Shining Bronze-Cuckoo (Golden) in Queensland during this time period (e.g., this checklist with photos. Accordingly, to use all records, we need to rollup these two subspecies into one set of species-level information. This can happen in many situations. For example, this checklist has 10 Yellow-rumped Warblers, 5 each of two Yellow-rumped Warbler subspecies, and one hybrid between the two subspecies. auk_rollup()
will combine all four of these observations into a single Yellow-rumped Warbler observation.
The function auk_rollup()
drops all observations not identifiable to a species and rolls up all observations reported below species to the species level.
<- auk_rollup(observations)
observations_rollup
# only checklist example with both subspecies and one species-level entry
%>%
observations filter(sampling_event_identifier == "S102653162") %>%
select(sampling_event_identifier, common_name, subspecies_common_name,
observation_count)#> # A tibble: 3 × 4
#> sampling_event_identifier common_name subspecies_common_name observation_count
#> <chr> <chr> <chr> <chr>
#> 1 S102653162 Shining Br… Shining Bronze-Cuckoo… 1
#> 2 S102653162 Shining Br… <NA> 2
#> 3 S102653162 Shining Br… Shining Bronze-Cuckoo… 1
%>%
observations_rollup filter(sampling_event_identifier == "S102653162") %>%
select(sampling_event_identifier, common_name,
observation_count)#> # A tibble: 1 × 3
#> sampling_event_identifier common_name observation_count
#> <chr> <chr> <chr>
#> 1 S102653162 Shining Bronze-Cuckoo 4
If multiple taxa on a single checklist roll up to the same species, auk_rollup()
attempts to combine them intelligently. If each observation has a count, those counts are added together, but if any of the observations is missing a count (i.e. the count is “X”) the combined observation is also assigned an “X”. In the example checklist from the previous tip, with four taxa all rolling up to Yellow-rumped Warbler, auk_rollup()
will add the four counts together to get 21 Yellow-rumped Warblers (10 + 5 + 5 + 1).
1.1.3 Generating detection/non-detection data
Complete eBird checklists are extremely valuable because, for all species that weren’t reported, we can infer counts of 0. This allows us to convert eBird from presence only data to detection/non-detection data, which allows for much more robust analyses. Note that we don’t use the term presence/absence data here because a non-detection doesn’t necessarily imply the species was absent, only that the observer didn’t detect and identify it.
We refer to the process of producing detection/non-detection data as “zero-filling” the eBird data because we’re filling in the missing zeros. We’ll read the eBird data into R again, filter to only complete checklists, then use the function auk_zerofill()
to generate detection/non-detection data. Note that shared checklists are combined and taxonomic rollup is performed by default when using the read_*()
functions from auk
.
# import checklist data
<- read_sampling(f_sed) %>%
checklists # subset to complete checklists
filter(all_species_reported)
# import observation data
<- read_ebd(f_ebd) %>%
observations # subset to complete checklists
filter(all_species_reported)
# zero-fill to produce detection/non-detection data
<- auk_zerofill(observations, checklists, collapse = TRUE)
zf glimpse(zf)
#> Rows: 509,294
#> Columns: 38
#> $ checklist_id <chr> "S85875959", "S96568064", "S86299492", "S978…
#> $ last_edited_date <chr> "2021-04-19 02:40:19.157994", "2021-10-22 20…
#> $ country <chr> "Australia", "Australia", "Australia", "Aust…
#> $ country_code <chr> "AU", "AU", "AU", "AU", "AU", "AU", "AU", "A…
#> $ state <chr> "Queensland", "Queensland", "Queensland", "Q…
#> $ state_code <chr> "AU-QLD", "AU-QLD", "AU-QLD", "AU-QLD", "AU-…
#> $ county <chr> "Brisbane", "Brisbane", "Brisbane", "Brisban…
#> $ county_code <chr> "AU-QLD-BRI", "AU-QLD-BRI", "AU-QLD-BRI", "A…
#> $ iba_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ bcr_code <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ usfws_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ atlas_block <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ locality <chr> "Dowse Lagoon (Sandgate)", "Dowse Lagoon (Sa…
#> $ locality_id <chr> "L1896576", "L1896576", "L1896576", "L189657…
#> $ locality_type <chr> "H", "H", "H", "H", "H", "H", "H", "H", "H",…
#> $ latitude <dbl> -27.3, -27.3, -27.3, -27.3, -27.3, -27.3, -2…
#> $ longitude <dbl> 153, 153, 153, 153, 153, 153, 153, 153, 153,…
#> $ observation_date <date> 2021-04-19, 2021-10-23, 2021-04-25, 2021-11…
#> $ time_observations_started <chr> "15:40:00", "06:04:00", "12:25:00", "13:53:0…
#> $ observer_id <chr> "obs427309", "obs427309", "obs535737", "obs9…
#> $ sampling_event_identifier <chr> "S85875959", "S96568064", "S86299492", "S978…
#> $ protocol_type <chr> "Traveling", "Traveling", "Stationary", "Tra…
#> $ protocol_code <chr> "P22", "P22", "P21", "P22", "P22", "P21", "P…
#> $ project_code <chr> "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD_A…
#> $ duration_minutes <int> 59, 138, 33, 46, 75, 16, 67, 104, 28, 45, 80…
#> $ effort_distance_km <dbl> 1.954, 3.437, NA, 1.610, 1.030, NA, 1.310, 2…
#> $ effort_area_ha <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ number_observers <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ all_species_reported <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
#> $ group_identifier <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ trip_comments <chr> NA, NA, NA, "Clear skies light breeze. Targe…
#> $ scientific_name <chr> "Chrysococcyx lucidus", "Chrysococcyx lucidu…
#> $ breeding_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ breeding_category <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ behavior_code <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ age_sex <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ observation_count <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0",…
#> $ species_observed <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
The observation_count
variable has true counts as well as “X”s, which indicate that the species was detected but the number of individuals was not counted. auk_zerofill()
adds a new binary column, species_observed
, indicating whether or not the species was detected.
select(zf, observation_count, species_observed) %>%
head(10)
#> # A tibble: 10 × 2
#> observation_count species_observed
#> <chr> <lgl>
#> 1 0 FALSE
#> 2 0 FALSE
#> 3 0 FALSE
#> 4 0 FALSE
#> 5 0 FALSE
#> 6 0 FALSE
#> # ℹ 4 more rows
Let’s convert the “X”s to NAs and transform observation_count
to an integer variable.
$observation_count <- if_else(zf$observation_count == "X",
zfNA_character_, zf$observation_count) %>%
as.integer()
select(zf, observation_count, species_observed) %>%
head(10)
#> # A tibble: 10 × 2
#> observation_count species_observed
#> <int> <lgl>
#> 1 0 FALSE
#> 2 0 FALSE
#> 3 0 FALSE
#> 4 0 FALSE
#> 5 0 FALSE
#> 6 0 FALSE
#> # ℹ 4 more rows
1.1.4 Filtering data
Now that you have a detection/non-detection dataset, it’s likely that you want to do something with it. For example, you may want to make a map, identify priority areas for a species, or train a species distribution model. Regardless of the specific application, it’s likely that some amount of filtering of the data is required first. Some of the ways you may want to filter eBird data include:
- Temporal filtering: filter the data to a specific range of years or to a specific time of year.
- Spatial filtering: filter the data to focus on a specific region, e.g. a protected area.
- Increasing precision: some eBird checklists are quite long in distance or duration leading to spatial or temporal imprecision. By removing longer checklists we can increase the spatial precision of the dataset.
- Reducing variation in effort: unlike structured scientific surveys, data can be submitted to eBird using a variety of protocols and there is significant variation in effort between checklists in the eBird dataset. Variation in protocol and effort leads to variation in detectability (more effort generally leads to higher detectability). We can choose to impose more structure on the eBird dataset by filtering to reduce variation in protocol and effort.
The specific filtering you apply will depend on how you intend to use the eBird data. However, for the sake of this example, let’s filter the eBird data to only traveling and stationary checklists from 2013-2022 that are less than 6 hours in duration and 10 km in length.
<- zf %>%
zf_filtered filter(year(observation_date) >= 2013, year(observation_date) <= 2022,
%in% c("Traveling", "Stationary"),
protocol_type < 6 * 60,
duration_minutes < 10 | protocol_type == "Stationary")
effort_distance_km nrow(zf)
#> [1] 509294
nrow(zf_filtered)
#> [1] 342881
We reduced the number of checklists by 166,413, but the checklists remaining are of higher quality.
1.1.5 Environmental Covariate Assignment
At this point, if you were planning to run a species distribution model with this data, you’d want some environmental variables as predictors. However, adding environmental variables can be onerous, computationally expensive, and varies based on use case. We provide guidance in our “Best Practices for Using eBird Data” document on extracting environmental variables to use with eBird data. For continuing this work on your own, please use that reference. For the remainder of this workshop, we’re going to skip this process and use the eBird Reference Dataset (ERD) that already contains the environmental variables used in for eBird Status and Trends modeling.
1.2 eBird Reference Dataset (ERD)
The eBird Reference Dataset (ERD) is a subset of the full eBird database created annually for eBird Status and Trends modeling. Only semi-structured (complete checklists with effort information) traveling and stationary counts from the last 15 years are included in the ERD and we assign a set of environmental variables assigned to checklist. In the following sections we’ll provide an introduction to the ERD, describe the associated prediction grid used to make predictions across space, and highlight some of the challenges associated with using eBird data for analysis.
The ERD is distributed in two parts: observation data and checklist data. In the observation dataset, each row corresponds to the sighting of a single species on a checklist, including the count and any other species-level information. In the checklist dataset, each row corresponds to a checklist, including the date, time, location, effort (e.g. distance traveled, time spent, etc.), and any additional checklist-level information.
For this workshop, and extract of the ERD is provided in the workshop data package. The observations and checklsits datasets are provided in parquet format, an open source standard for efficient storage and retrieval of tabular data. If you haven’t already done so, following the instructions in the Introduction to create an RStudio project and download the workshop data package. The parquet files should be located at:
data/ebird_observations_AU-QLD.parquet
data/ebird_checklists_AU-QLD.parquet
Let’s start by reading these two datasets into R using the arrow
package and exploring them. We’ll start with the checklist dataset.
library(arrow)
library(auk)
library(dplyr)
library(ebirdst)
library(ggplot2)
library(sf)
library(terra)
<- read_parquet("data/ebird_checklists_AU-QLD_2022.parquet")
checklists glimpse(checklists)
#> Rows: 357,213
#> Columns: 151
#> $ checklist_id <int> 10864617, 11307779, 6971840, 102635…
#> $ observer_id <int> 267451, 113983, 162908, 113983, 166…
#> $ loc_id <chr> "L1560273", "L1650808", "H2561832",…
#> $ longitude <dbl> 153, 150, 139, 153, 145, 145, 152, …
#> $ latitude <dbl> -26.7, -28.5, -19.0, -27.2, -16.7, …
#> $ year <dbl> 2012, 2012, 2008, 2012, 2011, 2011,…
#> $ day_of_year <dbl> 128, 212, 180, 85, 284, 283, 34, 31…
#> $ hours_of_day <dbl> 11.65, 10.00, 6.75, 11.42, 15.42, 1…
#> $ solar_noon_diff <dbl> -0.0376, -2.0229, -2.0756, -0.3446,…
#> $ is_stationary <lgl> TRUE, FALSE, FALSE, TRUE, FALSE, FA…
#> $ effort_hrs <dbl> 0.167, 0.167, 8.000, 0.333, 0.583, …
#> $ effort_distance_km <dbl> 0.000, 2.000, 5.000, 0.000, 1.500, …
#> $ effort_speed_kmph <dbl> 0.000, 11.976, 0.625, 0.000, 2.573,…
#> $ num_observers <int> 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1,…
#> $ moon_fraction <dbl> 0.9874, 0.8842, 0.3158, 0.0594, 0.9…
#> $ moon_altitude <dbl> -0.7028, -0.6658, 0.6386, 0.6711, -…
#> $ cds_u10 <dbl> -0.1342, 1.9947, -5.1339, -3.9891, …
#> $ cds_v10 <dbl> -0.7690, 2.7611, 0.5347, 2.1257, -0…
#> $ cds_d2m <dbl> 8.580, 4.629, 8.287, 15.854, 11.582…
#> $ cds_t2m <dbl> 23.02, 8.29, 24.84, 23.38, 34.93, 3…
#> $ cds_hcc <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, …
#> $ cds_i10fg <dbl> 4.03, 8.04, 9.83, 10.01, 9.44, 8.83…
#> $ cds_mcc <dbl> 0.000000, 0.000000, 0.000000, 0.695…
#> $ cds_lcc <dbl> 0.0000, 0.0000, 0.0000, 0.7405, 0.0…
#> $ cds_sf <dbl> 8.67e-19, 0.00e+00, 0.00e+00, 0.00e…
#> $ cds_rf <dbl> 6.07e-18, 0.00e+00, 0.00e+00, 2.95e…
#> $ cds_slc <dbl> -133.21, -11.49, -12.35, -25.82, -1…
#> $ cds_msl <dbl> 101650, 102426, 101786, 101845, 100…
#> $ eastness_1km_median <dbl> 3.61e-02, 8.04e-08, 4.35e-02, 1.12e…
#> $ eastness_1km_sd <dbl> 0.23578, 0.03816, 0.20141, 0.25066,…
#> $ eastness_90m_median <dbl> 7.59e-02, -2.76e-04, 1.51e-03, 2.74…
#> $ eastness_90m_sd <dbl> 0.16464, 0.00216, 0.02724, 0.06283,…
#> $ northness_1km_median <dbl> -7.84e-03, 7.70e-02, -6.40e-02, 5.0…
#> $ northness_1km_sd <dbl> 0.0950, 0.0705, 0.2609, 0.1821, 0.3…
#> $ northness_90m_median <dbl> -0.007993, 0.000385, -0.001088, -0.…
#> $ northness_90m_sd <dbl> 0.13644, 0.00182, 0.01958, 0.05712,…
#> $ elevation_250m_median <dbl> 317.450, 218.982, 143.838, 64.140, …
#> $ elevation_250m_sd <dbl> 47.518, 0.878, 6.444, 18.788, 10.73…
#> $ elevation_30m_median <dbl> 314.7, 217.5, 140.8, 54.8, 394.3, 3…
#> $ elevation_30m_sd <dbl> 55.56, 5.03, 9.73, 24.47, 13.24, 13…
#> $ island <dbl> 30005, 30005, 30005, 30005, 30005, …
#> $ astwbd_c1_ed <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ astwbd_c1_pland <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
#> $ astwbd_c2_ed <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, …
#> $ astwbd_c2_pland <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, …
#> $ astwbd_c3_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ astwbd_c3_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ gsw_c2_pland <dbl> 0.00000, 0.95312, 0.31391, 0.02483,…
#> $ gsw_c2_ed <dbl> 0.000, 10.504, 3.197, 0.550, 0.184,…
#> $ gsw_c3_pland <dbl> 0.00000, 0.19215, 0.12675, 0.00993,…
#> $ gsw_c3_ed <dbl> 0.00, 3.01, 1.58, 0.22, 0.00, 0.00,…
#> $ intertidal_c1_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ intertidal_c1_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ ntl_mean <dbl> 0.0000, 7.1068, 0.0000, 0.2118, 0.2…
#> $ ntl_sd <dbl> 0.0000, 5.4641, 0.0000, 0.3992, 0.4…
#> $ road_density_c1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ road_density_c2 <dbl> 0, 681, 0, 545, 422, 422, 0, 0, 0, …
#> $ road_density_c3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ road_density_c4 <dbl> 410, 321, 631, 0, 80, 80, 287, 287,…
#> $ road_density_c5 <dbl> 0, 0, 0, 766, 0, 0, 0, 0, 0, 0, 0, …
#> $ mcd12q1_lccs1_c1_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c1_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c2_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c2_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c11_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c11_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c12_ed <dbl> 9.50, 0.00, 0.00, 1.69, 0.00, 0.00,…
#> $ mcd12q1_lccs1_c12_pland <dbl> 53.60, 0.00, 0.00, 1.00, 0.00, 0.00…
#> $ mcd12q1_lccs1_c13_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c13_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c14_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c14_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c15_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c15_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c16_ed <dbl> 0.00, 0.00, 0.00, 0.00, 1.76, 1.76,…
#> $ mcd12q1_lccs1_c16_pland <dbl> 0.000, 0.000, 0.000, 0.000, 3.001, …
#> $ mcd12q1_lccs1_c21_ed <dbl> 11.66, 0.00, 0.00, 5.08, 0.00, 0.00…
#> $ mcd12q1_lccs1_c21_pland <dbl> 43.72, 0.00, 0.00, 9.40, 0.00, 0.00…
#> $ mcd12q1_lccs1_c22_ed <dbl> 2.158, 6.516, 0.899, 9.734, 4.405, …
#> $ mcd12q1_lccs1_c22_pland <dbl> 2.678, 41.444, 0.776, 76.168, 90.99…
#> $ mcd12q1_lccs1_c31_ed <dbl> 0.00, 6.52, 7.64, 4.66, 2.64, 2.64,…
#> $ mcd12q1_lccs1_c31_pland <dbl> 0.00, 58.56, 32.62, 13.43, 6.00, 6.…
#> $ mcd12q1_lccs1_c32_ed <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ mcd12q1_lccs1_c32_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c41_ed <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ mcd12q1_lccs1_c41_pland <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 24.9,…
#> $ mcd12q1_lccs1_c42_ed <dbl> 0.0, 0.0, 4.5, 0.0, 0.0, 0.0, 14.7,…
#> $ mcd12q1_lccs1_c42_pland <dbl> 0.00, 0.00, 9.01, 0.00, 0.00, 0.00,…
#> $ mcd12q1_lccs1_c43_ed <dbl> 0.0, 0.0, 10.3, 0.0, 0.0, 0.0, 0.0,…
#> $ mcd12q1_lccs1_c43_pland <dbl> 0.0, 0.0, 57.6, 0.0, 0.0, 0.0, 0.0,…
#> $ mcd12q1_lccs1_c255_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c255_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs2_c25_ed <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
#> $ mcd12q1_lccs2_c25_pland <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
#> $ mcd12q1_lccs2_c35_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs2_c35_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs2_c36_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs2_c36_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs3_c27_ed <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ mcd12q1_lccs3_c27_pland <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ mcd12q1_lccs3_c50_ed <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ mcd12q1_lccs3_c50_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs3_c51_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs3_c51_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ has_shoreline <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, …
#> $ shoreline_waveheight_mean <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, …
#> $ shoreline_waveheight_sd <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_tidal_range_mean <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ shoreline_tidal_range_sd <dbl> 0.00000, 0.00000, 0.00000, 0.00000,…
#> $ shoreline_chlorophyll_mean <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ shoreline_chlorophyll_sd <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0…
#> $ shoreline_turbidity_mean <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, …
#> $ shoreline_turbidity_sd <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_sinuosity_mean <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ shoreline_sinuosity_sd <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ shoreline_slope_mean <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ shoreline_slope_sd <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
#> $ shoreline_outflow_density_mean <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e…
#> $ shoreline_outflow_density_sd <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_erodibility_n <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_erodibility_c1_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_erodibility_c2_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_erodibility_c3_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_erodibility_c4_density <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
#> $ shoreline_emu_physical_n <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c1_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c2_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c3_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c4_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c5_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c6_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c7_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c8_density <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, …
#> $ shoreline_emu_physical_c9_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c10_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c11_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c12_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c13_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c14_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c15_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c16_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c17_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c18_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c19_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c20_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c21_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c22_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c23_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ has_evi <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
#> $ evi_median <dbl> 56375768, 26577095, 16683584, 49761…
#> $ evi_sd <dbl> 4978114, 6436081, 5444980, 6107690,…
There are a huge number of columns in this data frame. The first set of variables provide standard information about the checklist: where and when did the observation occur, what type of search was conducted, and how much search effort was expended. Two important differences exist between these variables and what you will see if you look at the raw eBird dataset: when a GPS track is available we replace the checklist or hotspot location (latitude/longitude) with the centroid of the track and the time of the checklist is expressed as the difference between the checklist midpoint and solar noon, a more ecologically meaningful quantity.
All the remaining variables are not collected in eBird, they’re calculated and added by the Status and Trends team based on external data sets. First, those variables beginning with cds_
, provides information about the weather at the time of the observation, which can impact detectibility. This is followed by a large suite of environmental variables summarized over a 3km diameter circular neighborhood around the checklist location, including variables describing: elevation and topography, land and water cover, roads, and night time lights (a proxy for urban development). Most variables are summarized as two quantities expressing composition (what habitat is available) and configuration (how that habitat is arranged spatially). For continuous variables, such as elevation, we use the median and standard deviation. For categorical variables, such as land cover class, we use percent landcover (pland
) and edge density (ed
).
The land and water cover variables can be challenging to interpret based on their names alone (e.g. mcd12q1_lccs1_c12_pland
); however, these names can be looked up in the ebirdst_predictors
data frame from the ebirdst
package. For example, let’s look up what mcd12q1_lccs1_c12_pland
corresponds to.
filter(ebirdst_predictors, predictor == "mcd12q1_lccs1_c12_pland") %>%
select(predictor, label)
#> # A tibble: 1 × 2
#> predictor label
#> <chr> <chr>
#> 1 mcd12q1_lccs1_c12_pland Evergreen Broadleaf Forests (% cover)
Take some time to explore the variables in the checklist dataset. Try looking up a variable in ebirdst_predictors
. Ask for help if you need clarification on the meaning of any of the variables.
Now let’s look at the observation dataset.
<- read_parquet("data/ebird_observations_AU-QLD_2022.parquet")
observations glimpse(observations)
#> Rows: 7,292,576
#> Columns: 5
#> $ checklist_id <dbl> 10928575, 10928575, 10928575, 10928575, 10928575, 1092857…
#> $ species_code <chr> "rensti", "faecur", "grsplo", "strher", "cursan", "batgod…
#> $ valid <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
#> $ obs_detected <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ obs_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
This is a much simpler dataset with only five columns:
checklist_id
: unique identifier for the checklist that this observation belongs to. Allows joining the observation data to the checklist data.species_code
: unique identifier for the species that this observation was made for.valid
: a binary variable indicating is the observation was determined to be valid (TRUE
) or invalid (FALSE
) by the eBird reviewers.obs_detected
: a binary variable indicating if the species was detected (1) or not detected (0). Since this is a dataset of observations only,obs_detected
is always 1; however, having this variable will become useful when we join to the checklist dataset in the next section.obs_count
: count of the number of individuals or anNA
if no count was provided (if an “X” was entered for count on the eBird checklist).
To look up the common name or scientific name of a species try appending the species code to the URL https://ebird.org/species/
. For example, visit https://ebird.org/species/maslap1 to look up the species code maslap1
. This information is also available in the ebird_taxonomy
data frame in the auk
package.
filter(ebird_taxonomy, species_code == "maslap1") %>%
select(species_code, common_name, scientific_name, family)
#> species_code common_name scientific_name family
#> 1 maslap1 Masked Lapwing Vanellus miles Charadriidae
1.2.1 Zero-filling eBird data
Complete eBird checklists are extremely valuable because, for all species that weren’t reported, we can infer counts of 0. This allows us to convert eBird from presence only data to detection/non-detection data, which allows for much more robust analyses. Note that we don’t use the term presence/absence data here because a non-detection doesn’t necessarily imply the species was absent, only that observer wasn’t able to detect and identify it.
We refer to the process of producing detection/non-detection data as “zero-filling” the eBird data because we’re filling in the missing zeros. Let’s consider observations of Shining Bronze-Cuckoo (species code shbcuc1
).
<- observations %>%
bird_detections filter(species_code == "shbcuc1") %>%
select(checklist_id, valid, obs_detected, obs_count)
Next join this set of detections to the complete set of checklists, including detections and non-detections.
<- left_join(checklists, bird_detections, by = "checklist_id") %>%
bird_all select(checklist_id, latitude, longitude, year, day_of_year,
valid, obs_detected, obs_count)head(bird_all)
#> # A tibble: 6 × 8
#> checklist_id latitude longitude year day_of_year valid obs_detected obs_count
#> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <int> <int>
#> 1 10864617 -26.7 153. 2012 128 NA NA NA
#> 2 11307779 -28.5 150. 2012 212 NA NA NA
#> 3 6971840 -19.0 139. 2008 180 NA NA NA
#> 4 10263519 -27.2 153. 2012 85 TRUE 1 1
#> 5 9020044 -16.7 145. 2011 284 NA NA NA
#> 6 9009843 -16.7 145. 2011 283 NA NA NA
Finally, for rows where the bird was not detected we can replace the missing counts with 0. At this time, we recommend removing any checklists with valid == 0
because there is uncertainty about whether or not the species was detected. Let’s also filter to a subset of months, to keep the data smaller and relevant to a particular season.
<- bird_all %>%
bird_zf filter(is.na(valid) | valid == 1) %>%
mutate(
# checklist not in the observations dataset are non-detections
obs_detected = coalesce(obs_detected, 0L),
# non-detections correspond to a count of 0
obs_count = if_else(obs_detected == 1, obs_count, 0)
%>%
) # approximately november through january
filter(day_of_year >= 305 | day_of_year < 32)
We can now, for example, make a map of observations. We’ll use spatial data that was prepared in advance and provided in the data package.
# load and project gis data
<- "+proj=laea +lon_0=146.95 +lat_0=-19.15 +datum=WGS84 +units=m +no_defs"
map_proj <- read_sf("data/gis-data.gpkg", "ne_land") %>%
ne_land st_transform(crs = map_proj) %>%
st_geometry()
<- read_sf("data/gis-data.gpkg", "ne_country_lines") %>%
ne_country_lines st_transform(crs = map_proj) %>%
st_geometry()
<- read_sf("data/gis-data.gpkg", "ne_state_lines") %>%
ne_state_lines st_transform(crs = map_proj) %>%
st_geometry()
<- read_sf("data/gis-data.gpkg", "regions") %>%
target_state filter(state_code == "AU-QLD") %>%
st_transform(crs = map_proj) %>%
st_geometry()
# prepare ebird data for mapping
<- bird_zf %>%
bird_sf # convert to spatial points
st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
st_transform(crs = map_proj)
# map
par(mar = c(0.25, 0.25, 0.25, 0.25))
# set up plot area
plot(st_geometry(target_state), col = NA, border = NA)
# contextual gis data
plot(ne_land, col = "#cfcfcf", border = "#888888", lwd = 0.5, add = TRUE)
plot(target_state, col = "#e6e6e6", border = NA, add = TRUE)
plot(ne_state_lines, col = "#ffffff", lwd = 0.75, add = TRUE)
plot(ne_country_lines, col = "#ffffff", lwd = 1.5, add = TRUE)
# ebird observations
# all
plot(bird_sf,
pch = 19, cex = 0.1, col = scales::alpha("#555555", 4),
add = TRUE)
# detection
plot(filter(bird_sf, obs_detected == 1),
pch = 19, cex = 0.3, col = scales::alpha("#4daf4a", 1),
add = TRUE)
# legend
legend("bottomright", bty = "n",
col = c("#555555", "#4daf4a"),
legend = c("eBird checklists", "Bird sightings"),
pch = 19)
box()
par(new = TRUE, mar = c(0, 0, 3, 0))
title("Shining Bronze-Cuckoo eBird Observations\nNovember-January 2007-2022")
1.2.2 Prediction grid
The ultimate goal of modeling the occurrence or abundance of a species is frequently to produce a map showing the distribution of that species in space. To do so, we need to know the values of our predictor variables over the region that we intend to make predictions. To make this possible, the ERD is distributed with a prediction grid: a regular grid of points covering the entire globe spaced 3km apart for which all the environmental variables have been calculated for the year 2022.
The data package for this course contains an example subset of the prediction grid. The file data/prediction-grid_year_AU-QLD.parquet
contains the environmental variables for each point on the grid and the file data/prediction-grid_template.tif
is a 3km by 3km raster template where each each cell center is a point on the prediction grid. Let’s start by examining the environmental variables.
<- read_parquet("data/prediction-grid_year_AU-QLD_2022.parquet")
prediction_grid glimpse(prediction_grid)
#> Rows: 230,422
#> Columns: 123
#> $ srd_id <int> 50299954, 50299955, 50299956, 50299…
#> $ longitude <dbl> 142, 142, 142, 142, 142, 142, 142, …
#> $ latitude <dbl> -9.19, -9.19, -9.19, -9.19, -9.19, …
#> $ eastness_1km_median <dbl> -8.11e-03, -9.89e-03, 4.66e-02, 3.1…
#> $ eastness_1km_sd <dbl> 0.0875, 0.1937, 0.1585, 0.1373, 0.1…
#> $ eastness_90m_median <dbl> 6.06e-04, 5.23e-05, 1.53e-04, -2.61…
#> $ eastness_90m_sd <dbl> 0.00474, 0.00407, 0.00657, 0.01536,…
#> $ northness_1km_median <dbl> -6.06e-02, 1.53e-01, 7.75e-02, -5.6…
#> $ northness_1km_sd <dbl> 0.3412, 0.1793, 0.2380, 0.1677, 0.3…
#> $ northness_90m_median <dbl> -0.001320, -0.000886, 0.000655, 0.0…
#> $ northness_90m_sd <dbl> 0.01242, 0.01134, 0.01204, 0.01721,…
#> $ elevation_250m_median <dbl> 14.513, 13.969, 14.086, 0.174, 13.9…
#> $ elevation_250m_sd <dbl> 2.67, 3.22, 3.96, 7.31, 5.04, 5.78,…
#> $ elevation_30m_median <dbl> 14.1, 13.3, 12.2, 13.1, 14.4, 17.5,…
#> $ elevation_30m_sd <dbl> 5.06, 5.09, 4.50, 4.40, 5.30, 5.22,…
#> $ island <dbl> 392, 392, 392, 392, 1067, 1067, 106…
#> $ astwbd_c1_ed <dbl> 2.26, 2.26, 5.95, 5.95, 9.26, 14.02…
#> $ astwbd_c1_pland <dbl> 2.20, 1.21, 11.17, 81.91, 42.20, 37…
#> $ astwbd_c2_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ astwbd_c2_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ astwbd_c3_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ astwbd_c3_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ gsw_c2_pland <dbl> 0.823, 6.787, 8.823, 7.259, 1.704, …
#> $ gsw_c2_ed <dbl> 5.27, 18.01, 16.18, 11.68, 14.06, 2…
#> $ gsw_c3_pland <dbl> 1.8037, 0.5269, 11.5355, 81.5255, 3…
#> $ gsw_c3_ed <dbl> 2.228, 1.398, 5.633, 5.833, 9.295, …
#> $ intertidal_c1_ed <dbl> 2.24, 2.16, 6.65, 16.81, 13.13, 7.9…
#> $ intertidal_c1_pland <dbl> 2.37, 1.06, 12.20, 19.01, 10.81, 18…
#> $ ntl_mean <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ ntl_sd <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ road_density_c1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ road_density_c2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ road_density_c3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ road_density_c4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ road_density_c5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c1_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c1_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c2_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c2_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c11_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c11_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c12_ed <dbl> 7.338, 10.131, 8.094, 3.083, 8.544,…
#> $ mcd12q1_lccs1_c12_pland <dbl> 17.695, 36.156, 44.618, 4.919, 26.6…
#> $ mcd12q1_lccs1_c13_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c13_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c14_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c14_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c15_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c15_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c16_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c16_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c21_ed <dbl> 12.52, 13.21, 16.19, 7.49, 13.94, 1…
#> $ mcd12q1_lccs1_c21_pland <dbl> 68.50, 28.15, 40.24, 19.06, 33.23, …
#> $ mcd12q1_lccs1_c22_ed <dbl> 2.16, 7.05, 3.15, 0.00, 0.00, 0.00,…
#> $ mcd12q1_lccs1_c22_pland <dbl> 3.10, 18.50, 6.22, 0.00, 0.00, 0.00…
#> $ mcd12q1_lccs1_c31_ed <dbl> 4.75, 6.61, 2.70, 0.00, 0.00, 0.00,…
#> $ mcd12q1_lccs1_c31_pland <dbl> 10.64, 17.13, 1.36, 0.00, 0.00, 0.0…
#> $ mcd12q1_lccs1_c32_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c32_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c41_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c41_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c42_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c42_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c43_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c43_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c255_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs1_c255_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs2_c25_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs2_c25_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs2_c35_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs2_c35_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs2_c36_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs2_c36_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs3_c27_ed <dbl> 3.89, 4.85, 7.19, 4.40, 7.19, 8.37,…
#> $ mcd12q1_lccs3_c27_pland <dbl> 45.60, 52.36, 90.17, 23.97, 59.84, …
#> $ mcd12q1_lccs3_c50_ed <dbl> 1.295, 6.607, 0.899, 0.000, 0.000, …
#> $ mcd12q1_lccs3_c50_pland <dbl> 2.573, 13.835, 0.626, 0.000, 0.000,…
#> $ mcd12q1_lccs3_c51_ed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ mcd12q1_lccs3_c51_pland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ has_shoreline <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
#> $ shoreline_waveheight_mean <dbl> 0.380, 0.380, 0.380, 0.380, 0.380, …
#> $ shoreline_waveheight_sd <dbl> 0.00000, 0.00000, 0.00000, 0.00000,…
#> $ shoreline_tidal_range_mean <dbl> 4.24, 4.13, 4.09, 4.06, 4.02, 3.93,…
#> $ shoreline_tidal_range_sd <dbl> 0.0000, 0.0462, 0.0191, 0.0000, 0.0…
#> $ shoreline_chlorophyll_mean <dbl> 3.70, 3.76, 3.76, 3.84, 3.83, 3.79,…
#> $ shoreline_chlorophyll_sd <dbl> 0.037448, 0.000638, 0.045479, 0.044…
#> $ shoreline_turbidity_mean <dbl> 0.1450, 0.1450, 0.1884, 0.2029, 0.0…
#> $ shoreline_turbidity_sd <dbl> 0.000000, 0.000000, 0.028934, 0.000…
#> $ shoreline_sinuosity_mean <dbl> 1.04, 1.04, 1.10, 1.10, 1.32, 1.34,…
#> $ shoreline_sinuosity_sd <dbl> 0.000, 0.000, 0.000, 0.000, 0.405, …
#> $ shoreline_slope_mean <dbl> 120.0, 13.3, 85.0, 56.0, 89.3, 74.3…
#> $ shoreline_slope_sd <dbl> 50.22, 5.74, 48.43, 10.09, 61.16, 4…
#> $ shoreline_outflow_density_mean <dbl> 1.06e-04, 4.18e-05, 4.18e-05, 4.18e…
#> $ shoreline_outflow_density_sd <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_erodibility_n <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,…
#> $ shoreline_erodibility_c1_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_erodibility_c2_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_erodibility_c3_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_erodibility_c4_density <dbl> 181, 149, 367, 366, 677, 1088, 1240…
#> $ shoreline_emu_physical_n <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,…
#> $ shoreline_emu_physical_c1_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c2_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c3_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c4_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c5_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c6_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c7_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c8_density <dbl> 181, 149, 367, 366, 677, 1088, 1240…
#> $ shoreline_emu_physical_c9_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c10_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c11_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c12_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c13_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c14_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c15_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c16_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c17_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c18_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c19_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c20_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c21_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c22_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shoreline_emu_physical_c23_density <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
These variables should be mostly familiar from the ERD, except for srd_id
which is a unique identifier for each point on the grid. Next let’s load the raster template using the terra
package.
<- rast("data/prediction-grid_template.tif")
raster_template
raster_template#> class : SpatRaster
#> dimensions : 5630, 13511, 1 (nrow, ncol, nlyr)
#> resolution : 2963, 2963 (x, y)
#> extent : -2e+07, 2e+07, -6673060, 1e+07 (xmin, xmax, ymin, ymax)
#> coord. ref. : +proj=sinu +lon_0=0 +x_0=0 +y_0=0 +R=6371007.181 +units=m +no_defs
#> source : prediction-grid_template.tif
#> name : mask
#> min value : 1
#> max value : 1
This is a global 2.96km by 2.96km square grid in an equal area projection. We can use the terra
function rasterize
to insert values from the prediction grid into the template for mapping. For example, let’s make a raster dataset of percent cover of evergreen broadleaf forest (mcd12q1_lccs1_c12_pland
).
<- prediction_grid %>%
forest_cover # convert to spatial object using sf
st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>%
# transform to the coordinate reference system of the raster
st_transform(crs = crs(raster_template)) %>%
# rasterize the points using the raster template
rasterize(raster_template, field = "mcd12q1_lccs1_c12_pland")
Now we can make a simple map of evergreen broadleaf forest. Note that the raster template is global, but we can use trim()
to remove all areas that have missing values.
plot(trim(forest_cover), axes = FALSE)
The map looks distorted because the prediction grid uses a sinusoidal projection, which works well for analysis but not for mapping. In the next lesson, we’ll demonstrate how to project data into a coordinate reference system more suitable for mapping.
1.2.3 Spatial and temporal bias
Despite the strengths of eBird data, species observations collected through citizen science projects exhibit both spatial and temporal bias requiring special care when using them for rigorous analyses. Spatial bias occurs because eBird participants are more likely to be collect data near their homes, in easily accessible areas such as roadsides, or in areas known to be good for birding. Looking at the above map of bird observations it’s clear that the eBird checklists are clustered around cities and roads. Temporal bias occurs because participants preferentially collect data when they are available, such as weekends, and at times of year when they expect to observe more birds, notably during the breeding season. We can plot the distribution of checklists over the days of the year to see this bias:
<- checklists %>%
checklist_per_day filter(day_of_year < 366) %>%
count(day_of_year)
ggplot(checklist_per_day) +
aes(x = day_of_year, y = n) +
geom_line() +
scale_y_continuous(labels = scales::comma) +
labs(x = "Day of Year", y = "# checklists",
title = "Daily eBird checklists submission") +
theme_bw()
Three is a clear seasonal pattern to the number of eBird checklists submitted (fewer checklists in March) as well as daily and weekly variation within seasons. In addition, there are spikes in checklists submissions. What do you think could be causing these sudden increases?
Finally, for most species, there is strong class imbalance in the data, meaning there are usually many more non-detections than detections. As a results, a distribution model predicting that the species is absent everywhere will have high accuracy, but no ecological value. For example, the prevalence rate of this species is only 3%.
mean(bird_zf$obs_detected)
#> [1] 0.0275
To address these three issues (spatial bias, temporal bias, and class imbalance) we recommend subsampling the data using a technique called case controlled grid sampling. We overlay an equal area 3km by 3km grid over the checklists, then sample one detection and one non-detection from each grid cell for each week of each year. Let’s look at a simple example of how spatial grid sampling works.”
The function grid_sample_stratified()
from the ebirdst
package is specifically designed to perform case controlled grid sampling on eBird data. For example, let’s apply this technique to the bird observations.
# perform case controlled grid sampling
<- grid_sample_stratified(bird_zf, obs_column = "obs_detected")
bird_sampled
# how many checklists were removed?
nrow(bird_zf)
#> [1] 90620
nrow(bird_sampled)
#> [1] 33265
# how has prevalence changed
mean(bird_zf$obs_detected)
#> [1] 0.0275
mean(bird_sampled$obs_detected)
#> [1] 0.0435
So, after sampling, we’re left with 37% of the observations we started with, but the spatial and temporal bias has been significantly reduced.
We now have the data and tools necessary to model relative abundance using eBird data, which will be the focus of Lesson 2.