Toward a Flexible Metadata Pipeline for Fish Specimen Images

Jebbia, Dom; Wang, Xiaojun; Bakis, Yasin; Bart, Henry L.; Greenberg, Jane

doi:10.1007/978-3-031-39141-5_15

Communications in Computer and Information Science

2023

DOI: 10.1007/978-3-031-39141-5_15

|View full text |Cite

Toward a Flexible Metadata Pipeline for Fish Specimen Images

Dom Jebbia

Xiaojun Wang

Yasin Bakis

et al.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2023

2024

Publication Types

Select...

Article2

Relationship

Self Cite1

Independent1

Authors

Journals

Cited by 2 publications

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

A FAIR and modular image‐based workflow for knowledge discovery in the emerging field of imageomics

Balk,

Bradley,

Maruf

et al. 2024

Methods Ecol Evol

Self Cite

View full text Add to dashboard Cite

Image‐based machine learning tools are an ascendant ‘big data’ research avenue. Citizen science platforms, like iNaturalist, and museum‐led initiatives provide researchers with an abundance of data and knowledge to extract. These include extraction of metadata, species identification, and phenomic data. Ecological and evolutionary biologists are increasingly using complex, multi‐step processes on data. These processes often include machine learning techniques, often built by others, that are difficult to reuse by other members in a collaboration. We present a conceptual workflow model for machine learning applications using image data to extract biological knowledge in the emerging field of imageomics. We derive an implementation of this conceptual workflow for a specific imageomics application that adheres to FAIR principles as a formal workflow definition that allows fully automated and reproducible execution, and consists of reusable workflow components. We outline technologies and best practices for creating an automated, reusable and modular workflow, and we show how they promote the reuse of machine learning models and their adaptation for new research questions. This conceptual workflow can be adapted: it can be semi‐automated, contain different components than those presented here, or have parallel components for comparative studies. We encourage researchers—both computer scientists and biologists—to build upon this conceptual workflow that combines machine learning tools on image data to answer novel scientific questions in their respective fields.

show abstract

A FAIR and modular image‐based workflow for knowledge discovery in the emerging field of imageomics

Balk,

Bradley,

Maruf

et al. 2024

Methods Ecol Evol

Self Cite

View full text Add to dashboard Cite

show abstract

On Image Quality Metadata, FAIR in ML, AI-Readiness and Reproducibility: Fish-AIR example

Bakış,

Wang,

Altıntaş

et al. 2023

BISS

View full text Add to dashboard Cite

A new science discipline has emerged within the last decade at the intersection of informatics, computer science and biology: Imageomics. Like most other -omics fields, Imageomics also uses emerging technologies to analyze biological data but from the images. One of the most applied data analysis methods for image datasets is Machine Learning (ML). In 2019, we started working on a United States National Science Foundation (NSF) funded project, known as Biology Guided Neural Networks (BGNN) with the purpose of extracting information about biology by using neural networks and biological guidance such as species descriptions, identifications, phylogenetic trees and morphological annotations (Bart et al. 2021). Even though the variety and abundance of biological data is satisfactory for some ML analysis and the data are openly accessible, researchers still spend up to 80% of their time preparing data into a usable, AI-ready format, leaving only 20% for exploration and modeling (Long and Romanoff 2023). For this reason, we have built a dataset composed of digitized fish specimens, taken either directly from collections or from specialized repositories. The range of digital representations we cover is broad and growing, from photographs and radiographs, to CT scans, and even illustrations. We have added new groups of vocabularies to the dataset management system including image quality metadata, extended image metadata and batch metadata. With the image quality metadata and extended image metadata, we aimed to extract information from the digital objects that can possibly help ML scientists in their research with filtering, image processing and object recognition routines. Image quality metadata provides information about objects contained in the image, features and condition of the specimen, and some basic visual properties of the image, while extended image metadata provides information about technical properties of the digital file and the digital multimedia object (Bakış et al. 2021, Karnani et al. 2022, Leipzig et al. 2021, Pepper et al. 2021, Wang et al. 2021) (see details on Fish-AIR vocabulary web page). Batch metadata is used for separating different datasets and facilitates downloading and uploading data in batches with additional batch information and supplementary files. Additional flexibility, built into the database infrastructure using an RDF framework, will enable the system to host different taxonomic groups, which might require new metadata features (Jebbia et al. 2023). By the combination of these features, along with FAIR (Findable, Accessable, Interoperable, Reusable) principles, and reproducibility, we provide Artificial Intelligence Readiness (AIR; Long and Romanoff 2023) to the dataset. Fish-AIR provides an easy-to-access, filtered, annotated and cleaned biological dataset for researchers from different backgrounds and facilitates the integration of biological knowledge based on digitized preserved specimens into ML pipelines. Because of the flexible database infrastructure and addition of new datasets, researchers will also be able to access additional types of data—such as landmarks, specimen outlines, annotated parts, and quality scores—in the near future. Already, the dataset is the largest and most detailed AI-ready fish image dataset with integrated Image Quality Management System (Jebbia et al. 2023, Wang et al. 2021).

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Toward a Flexible Metadata Pipeline for Fish Specimen Images

Cited by 2 publications

References 45 publications

A FAIR and modular image‐based workflow for knowledge discovery in the emerging field of imageomics

A FAIR and modular image‐based workflow for knowledge discovery in the emerging field of imageomics

On Image Quality Metadata, FAIR in ML, AI-Readiness and Reproducibility: Fish-AIR example

Contact Info

Product

Resources

About