library(tidyverse)
library(datamodelr)
library(DiagrammeR)
library(DT)
library(sf)
Download and process the dataset from GBIF
Introduction
This document outlines the process of downloading, processing, and analyzing data from the GBIF dataset Monitoring data on the effect of domestic livestock and rabbits on Androcymbium europaeum pastures. The dataset contains information about the annual monitoring of the effect of herbivorism on the conservation status of endangered species Androcymbium europaeum. It has been generated from the SERPAM Department (Evaluation, Restoration and Protection of Mediterranean Agrosystems Service) of the Zaidín Experimental Station belonging to Spanish National Research Council (CSIC-EEZ).
Dataset features
The dataset Monitoring data on the effect of domestic livestock and rabbits on Androcymbium europaeum pastures (Pérez-Luque et al., 2023) is deposited in GBIF. It can be access at doi:10.15470/jpjhuu and here.
Since 2010, the SERPAM Department has been carrying out annual sampling to evaluate the effect of domestic and wild livestock (e.g. rabbits) on the pastures inhabited by Androcymbium europaeum.
A randomized block design was followed, and consisted of 6 blocks separated between 300 and 400 m, with three different treatments or management types (one plot of 2.5 x 2.5 m by treatment and block) (Figure 1):
- G+R+: herbivory by sheep and rabbits
- G-R+: excluding only sheep (fenced with hunting netting)
- G-R-: excluding rabbits and sheep (fenced with rhomboidal netting of 4 cm of light)
Abundance of Androcymbium europaeum
The density of A. europaeum in each plot was yearly evaluated by counting the number of individuals in 50 cm x 50 cm fixed squares, taking four squares per plot, distributed according to the four cardinal points (N, S, E and W): 24 squares per treatment (six blocks by four squares)(Figure 2). The exclusion plots were installed in May 2010, after the first density and biodiversity samplings (March and April), thus this year should be considered as year zero, without exclusion treatments.
Assesment of plant communities
The evaluation of plant communities where A. europaeum lives was conducted annually during the spring season to ensure that most annual species had grown and flowered, allowing for accurate identification. In each plot, the point-intercept (non-destructive) method was applied, following a modification proposed by Daget & Poissonet (1971) of the point-quadrat method originally described by Levy & Madden (1933). Two fixed crossed transects measuring 2 meters in length were established (Figure 2), with 50 points surveyed per transect. These points were spaced 4 cm apart, and the plant species in contact with a 2 mm needle were recorded. For each transect the following variables were determined:
Coverages:
- specific plant cover: percentage of soil covered by each plant species
- vegetation cover: percentage of soil covered by vegetation
- total specific plant cover: sum of the specific vegetation cover of each species
- musk/lichen cover: percentage of soil covered by musk or lichen
- bare soil cover: percentage of bare soil
Floristic composition:
- plant species richness: number of species
- Shannon diversity index (H’): \(\text{H'} = -\sum p_i \cdot \ln(p_i)\) where \(p_i\) is the proportion of individuals of one particular species found divided by the total number of individuals found.
GBIF Dataset Structure
Before diving into data processing, it’s important to understand the structure of the dataset. Our dataset was accomodated to the Darwin Core Archive (DwC-A) using the sampling-event data type (GBIF, 2018), following the structure of files of the Figure 3.
In the Figure 4 we shown the structure of the visits within our experimental design and the different events that could be generated.
Abundance data
First we have created 18 events with the spatial characteristics of the plots. Each of this event has an unique id (eventID
field in the table event
). For the abundance data of A. europaeum, in each plot, we have 4 quadrats, which are visited annually. For each of these quadrats, events with their spatial characteristics were created. These events are related to the parent events (plots) by means of the parentEventID
field in the event
table. The name of each quadrats is a combination of the parent’s plot name plus the replicate’s name. For instance the A10_QS
name of a quadrats correspond to the plot A10
and the south quadrat (QS
stands for replicate south). The four quadrats of each plot are named: QN
: northern, QS
: southern, QW
: western and QE
eastern one.
Then, each quadrat is visited annually and it generates an event with the following eventID
: A00_QX_00001122
, where:
A00
is the name of the plot (e.g.A10
)QX
is the name of the quadrat (e.g.RS
)00001122
is the date of the visit as year, month and day (e.g.20200204
)
Each quadrat data visit event is related to a quadrat event by means of the parentEventID
field in the event
table.
For each quadrat data visit event we have two related data type:
measurement: in each visit we recorded the total count (abundance) of the quadrat for the taxa A. europaeum. So, in the
extendedMeasurementOrFacts
table there are ameasurmentID
field with the following structure:A00_QX_00001122 A00
. For instance, the measurementIDA10_QE_20100323 A01
corresponds to the mesurement typeA01
(abundance) of the eventA10_QE_20100323
.occurrences: in each visit we recorded data of the taxa Androcymbium europaeum that is stored as an entry in the
occurrence
table. The fieldoccurenceID
in this table has the structure:A10_QE_20100323_01
, that correspond to the occurence01
of theeventID
A10_QE_20100323
. Theoccurrence
table has also theeventID
field that is related to theevent
table.
Assesment of plant communities
We have also the same 18 event plots that created in the previous section. For each plot, we sampled two linear transects called T1
and T2
. For each transect a spatial event was created following similar approach as the abundance data. So, an event called A10_T1
corresponds to the transect T1
of the plot A10
. Each transect were visited annually and we created an event for each visit combined the transect event and the date. For instance, the event A11_T1_20210420
corresponds to a visit performed on April-20 of 2021 (20210420
) in the transect T1
of the plot A11
.
For each transect visit we have several types of data.
Occurrence data. We recorded the presence of A. europaeum or Filago pyramidata. Those occurrence records were stored at
occurrence
table. For instance, a record with thisoccurenceID
:A10_T1_20210420_02
correspond to the taxa02
of theA10_T1_20210420
event, i.e. the transect T1 visited on April-20 of 2021.Specific measurements. For each taxa, we also have different measurement in each transect. For instance, for the taxa Stipa capensis we recorded a specific plant cover in a transect in a visit. Those measurements are stored in the
extendedMeasurementOrFacts
table with ameasurmentID
. So, the measurementIDA21_T1_20210420_05_V09
corresponds to the measurementV09
of the taxa05
obtained in the transectT1
of the plotA21
on April-20 of 2021 (20210420
).Transects measurements. We computed also some measurements at transect-level, such us, the diversity, plant species richness, total cover, among others. Those variables are also stored in the
extendedMeasurementOrFacts
table. Therefore, a record with themeasurmentID
A10_T1_20230411_V20
corresponds to the measurement V20 (in this case plant species richness) computed for the transectT1
of the plotA10
on April-1 of 2023 (20230411
).
Downloading and Extracting Data
First, we download and extract the dataset from the GBIF server. For this we used the url included in the GBIF’s dataset page, specifically we download the source archive (in Darwin Core Archive format). The url of this dataset is https://ipt.gbif.es/archive.do?r=amoladeras. After the donwload the file is unnzipped and stored in a temporary directory.
<- 'https://ipt.gbif.es/archive.do?r=amoladeras'
gbif_url
# Create a temporary directory for storing files
<- tempdir()
temp <- paste0(temp, "/amoladeras.zip")
f <- paste0(temp, "/amoladeras")
extract_dir
# Download the dataset
download.file(gbif_url, destfile = f, method = "auto")
# Unzip the downloaded file
unzip(f, exdir = extract_dir)
# Display the list of extracted files
list.files(extract_dir)
[1] "eml.xml" "event.txt"
[3] "extendedmeasurementorfact.txt" "meta.xml"
[5] "occurrence.txt"
# Read the occurrence, event, and extended measurementOrFact files
<- read_delim(file = paste0(extract_dir, "/occurrence.txt"))
occ <- read_delim(file = paste0(extract_dir, "/event.txt"))
event <- read_delim(file = paste0(extract_dir, "/extendedmeasurementorfact.txt")) emof
We have 5 files:
- Event file (
event.txt
, thereafterevent
) - Occurrence file (
occurrence.txt
, thereafterocc
) - extended Measurement Or Facts file (
extendedmeasurementorfact.txt
, thereafteremof
) - A file with the name of the darwin core terms included (
meta.xml
) - A file with the dataset metadata using the Ecological Metadata Language(Jones et al., 2019) (
eml.xml
)
The structure of the three data tables (files) used is shown in the Figure 5.
Getting Data for Plots
Next, we extract and process data related to plots from the dataset
# Filter events to obtain only the top-level plots
<- event |>
raw_plots filter(is.na(parentEventID)) |>
mutate(id_plot = str_remove(eventID, "CSIC-EEZ:SERPAM:AMOLADERAS_")) |>
::select(id_plot, eventID, fieldNumber, treatment_desc = fieldNotes, footprintWKT) |>
dplyrseparate(fieldNumber, into = c("block", "treatment"), sep = " \\| ")
# Convert to spatial data if needed
<- raw_plots |>
geo_plots st_as_sf(wkt = "footprintWKT", crs = 25830)
# Create a dataframe without spatial information
<- raw_plots |> dplyr::select(-footprintWKT) plots
Getting Data for Quadrats
We proceed to extract and process data related to quadrats.
# Filter events to obtain only quadrat-related data
<- event |>
raw_quadrats filter(samplingProtocol == "Quadrat count") |>
filter(parentEventID %in% plots$eventID) |>
mutate(id_quadrat = str_sub(eventID, -2)) |>
mutate(id_plot = str_remove(parentEventID, "CSIC-EEZ:SERPAM:AMOLADERAS_")) |>
::select(id_quadrat, id_plot, eventID, fieldNumber, treatment_desc = fieldNotes, footprintWKT) |>
dplyrseparate(fieldNumber, into = c("block", "treatment"), sep = " \\| ")
# Convert to spatial data if needed
<- raw_quadrats |>
geo_quadrats st_as_sf(wkt = "footprintWKT", crs = 25830)
# Create a dataframe without spatial information
<- raw_quadrats |> dplyr::select(-footprintWKT) quadrats
Getting Data for Transects
We now extract and process data related to transects.
# Filter events to obtain only transect-related data
<- event |>
raw_transects filter(samplingProtocol == "Point Quadrat Transect") |>
filter(parentEventID %in% plots$eventID) |>
mutate(id_transect = str_sub(eventID, -2)) |>
mutate(id_plot = str_remove(parentEventID, "CSIC-EEZ:SERPAM:AMOLADERAS_")) |>
::select(id_transect, id_plot, eventID, fieldNumber, treatment_desc = fieldNotes, footprintWKT) |>
dplyrseparate(fieldNumber, into = c("block", "treatment"), sep = " \\| ")
# Convert to spatial data if needed
<- raw_transects |>
geo_transects st_as_sf(wkt = "footprintWKT", crs = 25830)
# Create a dataframe without spatial information
<- raw_transects |> dplyr::select(-footprintWKT) transects
Data Processing for Quadrat Count and Measurements
With the data extracted, we proceed to process the events and measurements related to quadrat counts.
# Process events for quadrat counts
<- event |>
quadrats_event filter(samplingProtocol == "Quadrat count") |>
filter(parentEventID %in% quadrats$eventID) |>
mutate(date = lubridate::ymd(eventDate)) |>
::select(id, parentEventID, date)
dplyr
# Process measurements for quadrat counts
<-
quadrats_measurements |>
emof filter(id %in% quadrats_event$id) |>
::select(id, abundance = measurementValue)
dplyr
# Occurrence data for quadrat counts
<-
quadrats_occ |>
occ filter(id %in% quadrats_event$id) |>
::select(id, species = scientificName)
dplyr
# Combine event, measurement, and occurrence data for quadrat counts
<-
df_abundances |>
quadrats_event inner_join(quadrats_measurements, by = "id") |>
inner_join(quadrats_occ, by = "id") |>
inner_join(quadrats, by = c("parentEventID" = "eventID")) |>
::select(
dplyr
id_plot, id_quadrat, block, treatment, treatment_desc, date, abundance, species )
The results could be consulted here:
Data Processing for Transect Measurements
Similarly, we process the events and measurements related to transects.
# Process events for transects
<- event |>
transects_event filter(samplingProtocol == "Point Quadrat Transect") |>
filter(parentEventID %in% transects$eventID) |>
mutate(date = lubridate::ymd(eventDate)) |>
::select(id, parentEventID, date)
dplyr
# Process measurements for transects
<-
transects_measurements |>
emof filter(id %in% transects_event$id) |>
filter(measurementType != "specific vegetation cover") |>
::select(id, measurementType, measurementValue) |>
dplyrpivot_wider(names_from = measurementType, values_from = measurementValue)
# Combine event, measurement, and occurrence data for transects
<-
df_transects |>
transects_event inner_join(transects_measurements, by = "id") |>
inner_join(transects, by = c("parentEventID" = "eventID")) |>
::select(
dplyr:treatment_desc,
id_plot, id_transect, block
date, `bare soil cover`:`diversity index`
)
The results could be consulted here:
Conclusion
At this point, we have successfully extracted, processed, and organized the data from the GBIF dataset. Further analysis and visualization can be performed using the df_abundances
and df_transects
dataframes.
Software used
We used R version 4.2.1 (R Core Team, 2022) and the following R packages: datamodelr v. 0.2.2.9002 (Bergant, 2017), DiagrammeR v. 1.0.10 (Iannone, 2023), DT v. 0.29 (Xie et al., 2023), knitr v. 1.44 (Xie, 2014, 2015, 2023), rmarkdown v. 2.25 (Allaire et al., 2023; Xie et al., 2018, 2020), sf v. 1.0.14 (Pebesma, 2018; Pebesma & Bivand, 2023), tidyverse v. 2.0.0 (Wickham et al., 2019).
References
Citation
@report{a.j.2023,
author = {A.J., Pérez-Luque},
title = {Download and Process the Dataset {Monitoring} Data on the
Effect of Domestic Livestock and Rabbits on {Androcymbium} Europaeum
Pastures from {GBIF}},
version = {v1},
date = {2023-09-01},
url = {https://ajpelu.github.io/dp_androcymbium_lab/},
langid = {en}
}