Anna Yannakopoulos scite author profile

Anna Yannakopoulos

5Publications

86Citation Statements Received

228Citation Statements Given

How they've been cited

How they cite others

262

228

Affiliations

Michigan State University, University of Notre Dame, Computational Sciences (United States)

Publications

Order By: Most citations

Supervised learning is an accurate method for network-based gene classification

Liu

Mancuso

Yannakopoulos

et al. 2020

View full text Add to dashboard Cite

Background Assigning every human gene to specific functions, diseases and traits is a grand challenge in modern genetics. Key to addressing this challenge are computational methods, such as supervised learning and label propagation, that can leverage molecular interaction networks to predict gene attributes. In spite of being a popular machine-learning technique across fields, supervised learning has been applied only in a few network-based studies for predicting pathway-, phenotype- or disease-associated genes. It is unknown how supervised learning broadly performs across different networks and diverse gene classification tasks, and how it compares to label propagation, the widely benchmarked canonical approach for this problem. Results In this study, we present a comprehensive benchmarking of supervised learning for network-based gene classification, evaluating this approach and a classic label propagation technique on hundreds of diverse prediction tasks and multiple networks using stringent evaluation schemes. We demonstrate that supervised learning on a gene’s full network connectivity outperforms label propagaton and achieves high prediction accuracy by efficiently capturing local network properties, rivaling label propagation’s appeal for naturally using network topology. We further show that supervised learning on the full network is also superior to learning on node embeddings (derived using node2vec), an increasingly popular approach for concisely representing network connectivity. These results show that supervised learning is an accurate approach for prioritizing genes associated with diverse functions, diseases and traits and should be considered a staple of network-based gene classification workflows. Availability and implementation The datasets and the code used to reproduce the results and add new gene classification methods have been made freely available. Contact arjun@msu.edu Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Supervised-learning is an accurate method for network-based gene classification

Liu

Mancuso

Yannakopoulos

et al. 2019

Preprint

View full text Add to dashboard Cite

Background: Assigning every human gene to specific functions, diseases, and traits is a grand challenge in modern genetics. Key to addressing this challenge are computational methods such as supervised-learning and label-propagation that can leverage molecular interaction networks to predict gene attributes. In spite of being a popular machine learning technique across fields, supervised-learning has been applied only in a few network-based studies for predicting pathway-, phenotype-, or disease-associated genes. It is unknown how supervised-learning broadly performs across different networks and diverse gene classification tasks, and how it compares to label-propagation, the widely-benchmarked canonical approach for this problem. Results:In this study, we present a comprehensive benchmarking of supervised-learning for network-based gene classification, evaluating this approach and a state-of-the-art label-propagation technique on hundreds of diverse prediction tasks and multiple networks using stringent evaluation schemes. We demonstrate that supervised-learning on a gene's full network connectivity outperforms label-propagation and achieves high prediction accuracy by efficiently capturing local network properties, rivaling label-propagation's appeal for naturally using network topology. We further show that supervised-learning on the full network is also superior to learning on node-embeddings (derived using node2vec ), an increasingly popular approach for concisely representing network connectivity. Conclusion:These results show that supervised-learning is an accurate approach for prioritizing genes associated with diverse functions, diseases, and traits and should be considered a staple of network-based gene classification workflows. The datasets and the code used to reproduce the results and add new gene classification methods have been made freely available.

show abstract

Systematic tissue annotations of –omics samples by modeling unstructured metadata

Maldaver

Yannakopoulos

et al. 2021

Preprint

View full text Add to dashboard Cite

There are currently >1.3 million human –omics samples that are publicly available. However, this valuable resource remains acutely underused because discovering samples, say from a particular tissue of interest, from this ever-growing data collection is still a significant challenge. The major impediment is that sample attributes such as tissue/cell-type of origin are routinely described using non-standard, varied terminologies written in unstructured natural language. Here, we propose a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell-type annotations for –omics samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample text descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms in a structured ontology. Our approach significantly and substantially outperforms an advanced text annotation method (MetaSRA) that uses graph-based reasoning and a baseline method (TAGGER) that annotates text based on exact string matching. Model similarities between related tissues demonstrate that NLP-ML models capture biologically-meaningful signals in text, as does the ability of these models to classify tissue-associated biological processes and diseases based on their descriptions alone. NLP-ML models are nearly as accurate as models based on gene-expression profiles in predicting sample tissue annotations but have the distinct capability to classify samples irrespective of the –omics experiment type based on their text metadata. Python NLP-ML prediction code and trained tissue models are available at https://github.com/krishnanlab/txt2onto.

show abstract

Systematic tissue annotations of genomics samples by modeling unstructured metadata

et al. 2022

View full text Add to dashboard Cite

There are currently >1.3 million human –omics samples that are publicly available. This valuable resource remains acutely underused because discovering particular samples from this ever-growing data collection remains a significant challenge. The major impediment is that sample attributes are routinely described using varied terminologies written in unstructured natural language. We propose a natural-language-processing-based machine learning approach (NLP-ML) to infer tissue and cell-type annotations for genomics samples based only on their free-text metadata. NLP-ML works by creating numerical representations of sample descriptions and using these representations as features in a supervised learning classifier that predicts tissue/cell-type terms. Our approach significantly outperforms an advanced graph-based reasoning annotation method (MetaSRA) and a baseline exact string matching method (TAGGER). Model similarities between related tissues demonstrate that NLP-ML models capture biologically-meaningful signals in text. Additionally, these models correctly classify tissue-associated biological processes and diseases based on their text descriptions alone. NLP-ML models are nearly as accurate as models based on gene-expression profiles in predicting sample tissue annotations but have the distinct capability to classify samples irrespective of the genomics experiment type based on their text metadata. Python NLP-ML prediction code and trained tissue models are available at https://github.com/krishnanlab/txt2onto.

show abstract

Opportunistic Computing with Lobster: Lessons Learned from Scaling up to 25k Non-Dedicated Cores

Wolf

Woodard

et al. 2017

J. Phys.: Conf. Ser.

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.