The major challenge to understanding and cataloging plant diversity is devising novel approaches to speed up the process of discovery. It has taken more than 250 years to name the 400,000 known plant species using a laborious manual process that relies on a shrinking group of experts examining individual specimens in detail. An estimated one million herbarium specimens remain unidentified, and as many as 70,000 flowering plant species alone are yet to be described (Joppa et al., 2011). It is very likely that new species are waiting to be discovered in the unidentified specimen backlog. In fact, an estimated 84% of undescribed plant diversity may be present in herbarium collections; therefore, "herbaria are a major frontier for species discovery" (Bebber et al., 2010). Herbaria and the massive repository of data they contain provide snapshots of plant diversity through time. The integrity of the plant is maintained in herbaria as a pressed, dried specimen; a specimen collected hundreds of years ago looks much the same as one collected a month ago. Although some specimens may not initially be fully identified, all contain morphological features, and usually
State-of-the-art multilingual models depend on vocabularies that cover all of the languages the model will expect to see at inference time, but the standard methods for generating those vocabularies are not ideal for massively multilingual applications. In this work, we introduce a novel procedure for multilingual vocabulary generation that combines the separately trained vocabularies of several automatically derived language clusters, thus balancing the trade-off between cross-lingual subword sharing and language-specific vocabularies. Our experiments show improvements across languages on key multilingual benchmark tasks TYDI QA (+2.9 F1), XNLI (+2.1%), and WikiAnn NER (+2.8 F1) and factor of 8 reduction in out-of-vocabulary rate, all without increasing the size of the model or data.
Abstract. In emergency planning, consideration of emergency priorities is a necessity. This paper presents new formulations of the facility location problem (FLP) and vehicle routing problem with time windows (VRPTW) with considerations of priority. Our models ensure that higher priority locations are considered before the lower priority ones, for both facility and routing decisions. The FLP is solved using an MIP solver, while a tabu search based metaheuristic is developed for the solution of the VRPTW. Under a set of possible emergency scenarios with limited emergency resources, our models were able to serve higher priority locations better than the much utilized Maximal Coverage Location Problem (MCLP) model. We also present preliminary work and results for an integrated location-routing analysis which improves service results further.
Advances in machine vision technology are rapidly enabling new and innovative uses within the field of biodiversity. Computers are now able to use images to identify tens of thousands of species across a wide range of taxonomic groups in real time, notably demonstrated by iNaturalist.org, which suggests species IDs to users (https://www.inaturalist.org/pages/computer_vision_demo) as they create observation records. Soon it will be commonplace to detect species in video feeds or use the camera in a mobile device to search for species-related content on the Internet. The Global Biodiversity Information Facility (GBIF) has an important role to play in advancing and improving this technology, whether in terms of data, collaboration across teams, or citation practice. But in the short term, the most important role may relate to initiating a cultural shift in accepted practices for the use of GBIF-mediated data for training of artificial intelligence (AI). “Training datasets” play a critical role in achieving species recognition capability in any machine vision system. These datasets compile representative images containing the explicit, verifiable identifications of the species they include. High-powered computers run algorithms on these training datasets, analysing the imagery and building complex models that characterize defining features for each species or taxonomic group. Researchers can, in turn, apply the resulting models to new images, determining what species or group they likely contain. Current research in machine vision is exploring (a) the use of location and date information to further improve model results, (b) identification methods beyond species-level into attribute, character, trait, or part-level ID, with an eye toward human interpretability, and (c) expertise modeling for improved determination of “research grade” images and metadata. The GBIF community has amassed one of the largest datasets of labelled species images available on the internet: more than 33 million species occurrence records in GBIF.org have one or more images (https://www.gbif.org/occurrence/gallery). Machine vision models, when integrated into the data collection tools in use across the GBIF network, can improve the user experience. For example, in citizen science applications like iNaturalist, automated species suggestion helps even novice users contribute occurrence records to GBIF. Perhaps most importantly, GBIF has implemented uniform (and open) data licensing, established guidelines on citation and provided consistent methods for tracking data use through the Digital Object Identifiers (DOI) citation chain. GBIF would like to build on the lessons learned in these activities while striving to assist with this technology research and increase its power and availability. We envisage an approach as follows: To assist in developing and refining machine vision models, GBIF plans to provide training datasets, taking effort to ensure license and citation practice are respected. The training datasets will be issued with a DOI, and the contributing datasets will be linked through the DOI citation graph. To assist application developers, Google and Visipedia plan to build and publish openly-licensed models and tutorials for how to adapt them for localized use. Together we will strive to ensure that data is being used responsibly and transparently, to close the gap between machine vision scientists, application developers, and users and to share taxonomic trees capturing the taxon rank to which machine vision models can identify with confidence based on an image’s visual characteristics. To assist in developing and refining machine vision models, GBIF plans to provide training datasets, taking effort to ensure license and citation practice are respected. The training datasets will be issued with a DOI, and the contributing datasets will be linked through the DOI citation graph. To assist application developers, Google and Visipedia plan to build and publish openly-licensed models and tutorials for how to adapt them for localized use. Together we will strive to ensure that data is being used responsibly and transparently, to close the gap between machine vision scientists, application developers, and users and to share taxonomic trees capturing the taxon rank to which machine vision models can identify with confidence based on an image’s visual characteristics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.