2018
DOI: 10.3897/bdj.6.e26826
|View full text |Cite
|
Sign up to set email alerts
|

Data Leakage and Loss in Biodiversity Informatics

Abstract: The field of biodiversity informatics is in a massive, “grow-out” phase of creating and enabling large-scale biodiversity data resources. Because perhaps 90% of existing biodiversity data nonetheless remains unavailable for science and policy applications, the question arises as to how these existing and available data records can be mobilized most efficiently and effectively. This situation led to our analysis of several large-scale biodiversity datasets regarding birds and plants, detecting information gaps … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
33
0
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 34 publications
(34 citation statements)
references
References 29 publications
0
33
0
1
Order By: Relevance
“…We followed the cleaning pipeline outlined by Zizka et al (2019) and first filtered the data as downloaded from GBIF (“raw”, hereafter) using meta-data for those records for which they were available (although meta-data were often missing, Peterson et al, 2018 ), removing: (1) records with a coordinate precision below 100 km (as this represents the grain size of many macro-ecological analyses); (2) fossil records and records of unknown source; (3) records collected before 1945 (before the end of the Second World War, since coordinates of old records are often imprecise); and (4) records with an individual count of less than one and more than 99. Furthermore, we rounded the geographic coordinates to four decimal places and retained only one record per species per location (i.e., test for duplicated records).…”
Section: Methodsmentioning
confidence: 99%
“…We followed the cleaning pipeline outlined by Zizka et al (2019) and first filtered the data as downloaded from GBIF (“raw”, hereafter) using meta-data for those records for which they were available (although meta-data were often missing, Peterson et al, 2018 ), removing: (1) records with a coordinate precision below 100 km (as this represents the grain size of many macro-ecological analyses); (2) fossil records and records of unknown source; (3) records collected before 1945 (before the end of the Second World War, since coordinates of old records are often imprecise); and (4) records with an individual count of less than one and more than 99. Furthermore, we rounded the geographic coordinates to four decimal places and retained only one record per species per location (i.e., test for duplicated records).…”
Section: Methodsmentioning
confidence: 99%
“…On the other hand, raw data collected from the national atlases yielded less than half as many records, of which 22.23% were fit for our purpose. Although the leakage rate [40] was thus higher in the case of the GBIF, overall, we observed comparable retention of records in both datasets as well as similar factors leading to attrition. The main reason for discarding records was data quality.…”
Section: Discussionmentioning
confidence: 65%
“…However, this type of detailed intervention may be difficult to accomplish when working with very large datasets, where these time-consuming checks for errors and inconsistencies [46] can easily become cost-ineffective. Thus, it is of utmost importance that data publishers provide accurate data and comprehensive metadata to inform data users about the true limitations of the data [40].…”
Section: Discussionmentioning
confidence: 99%
“…These fields are essential to validate the original coordinates at finer spatial scales or to retrieve missing coordinates from gazetteers. Updating or completing these fields in the original herbarium labels should be straightforward and would avoid major data leakage (sensu Townsend Peterson et al 2018). Missing information on the identifier name (28%) had a similar impact on the taxonomic validation as occurrences not identified by family specialists (32%).…”
Section: Descriptive Results Of the Validation Process Of The Occurrementioning
confidence: 99%