Labelling strategies for hierarchical multi-label classification techniques

Triguero, Isaac; Vens, Celine

doi:10.1016/j.patcog.2016.02.017

Cited by 41 publications

(18 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Almeida and Borges [26] proposed an adaptation of K-Nearest Neighbours to address quantification learning in HMC. Similarly, Triguero and Vens [27] investigated how different thresholds can increase the performance of Predictive Clustering Trees in this context.…”

Section: Related Workmentioning

confidence: 99%

Machine learning for discovering missing or wrong protein function annotations

Nakano

Lietaert²,

Vens

2019

BMC Bioinformatics

Self Cite

View full text Add to dashboard Cite

Background A massive amount of proteomic data is generated on a daily basis, nonetheless annotating all sequences is costly and often unfeasible. As a countermeasure, machine learning methods have been used to automatically annotate new protein functions. More specifically, many studies have investigated hierarchical multi-label classification (HMC) methods to predict annotations, using the Functional Catalogue (FunCat) or Gene Ontology (GO) label hierarchies. Most of these studies employed benchmark datasets created more than a decade ago, and thus train their models on outdated information. In this work, we provide an updated version of these datasets. By querying recent versions of FunCat and GO yeast annotations, we provide 24 new datasets in total. We compare four HMC methods, providing baseline results for the new datasets. Furthermore, we also evaluate whether the predictive models are able to discover new or wrong annotations, by training them on the old data and evaluating their results against the most recent information. Results The results demonstrated that the method based on predictive clustering trees, Clus-Ensemble, proposed in 2008, achieved superior results compared to more recent methods on the standard evaluation task. For the discovery of new knowledge, Clus-Ensemble performed better when discovering new annotations in the FunCat taxonomy, whereas hierarchical multi-label classification with genetic algorithm (HMC-GA), a method based on genetic algorithms, was overall superior when detecting annotations that were removed. In the GO datasets, Clus-Ensemble once again had the upper hand when discovering new annotations, HMC-GA performed better for detecting removed annotations. However, in this evaluation, there were less significant differences among the methods. Conclusions The experiments have showed that protein function prediction is a very challenging task which should be further investigated. We believe that the baseline results associated with the updated datasets provided in this work should be considered as guidelines for future studies, nonetheless the old versions of the datasets should not be disregarded since other tasks in machine learning could benefit from them.

show abstract

Section: Related Workmentioning

confidence: 99%

Machine learning for discovering missing or wrong protein function annotations

Nakano

Lietaert²,

Vens

2019

BMC Bioinformatics

Self Cite

View full text Add to dashboard Cite

show abstract

“…The parameter k is often set to a fixed value in other research, or only iterated over a small set of possible values (e.g. 5,10,15). However, optimizing k can have a significant effect on reported evaluation metric values.…”

Section: Discussionmentioning

confidence: 99%

“…Finally we apply a single threshold to obtain the bipartition as different authors have experimentally verified this is as efficient as the more complex methods [3] [15]. We determine the threshold t min automatically by selecting the value of t min that minimizes the difference in label cardinality between the actual and predicted label set over all training instances.…”

Section: A Instance Based Knnmentioning

confidence: 99%

Combining Instance and Feature Neighbors for Efficient Multi-label Classification

Čule

Vens

Goethals

2017

2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

Self Cite

View full text Add to dashboard Cite

Multi-label classification problems occur naturally in different domains. For example, within text categorization the goal is to predict a set of topics for a document, and within image scene classification the goal is to assign labels to different objects in an image. In this work we propose a combination of two variations of k nearest neighborhoods (kNN) where the first neighborhood is computed instance (or row) based and the second neighborhood is feature (or column) based. Instance based kNN is inspired by user-based collaborative filtering, while feature kNN is inspired by item-based collaborative filtering. Finally we apply a linear combination of instance and feature neighbors scores and apply a single threshold to predict the set of labels. Experiments on various multi-label datasets show that our algorithm outperforms other state-of-the-art methods such as ML-kNN, IBLR and Binary Relevance with SVM, on different evaluation metrics. Finally our algorithm uses an inverted index during neighborhood search and scales to extreme datasets that have millions of instances, features and labels.

show abstract

“…Exploring whether a single threshold is appropriate for all of the labels, or whether multiple thresholds, one per label, should be used, is a promising line of future work. Specifically, examining the thresholding strategies of Tsoumakas and Katakis (2007) and Largeron et al (2012) as well as the work of Triguero and Vens (2016) and determining if and how their results can be applied in the streaming setting will be our first step along this avenue.…”

Section: Discussionmentioning

confidence: 99%

Multi-label classification via multi-target regression on data streams

2016

View full text Add to dashboard Cite

Multi-label classification (MLC) tasks are encountered more and more frequently in machine learning applications. While MLC methods exist for the classical batch setting, only a few methods are available for streaming setting. In this paper, we propose a new methodology for MLC via multi-target regression in a streaming setting. Moreover, we develop a streaming multi-target regressor iSOUP-Tree that uses this approach. We experimentally compare two variants of the iSOUP-Tree method (building regression and model trees), as well as ensembles of iSOUP-Trees with state-of-the-art tree and ensemble methods for MLC on data streams. We evaluate these methods on a variety of measures of predictive performance (appropriate for the MLC task). The ensembles of iSOUP-Trees perform significantly better on some of these measures, especially the ones based on label ranking, and are not significantly worse than the competitors on any of the remaining measures. We identify the thresholding problem for the task of MLC on data streams as a key issue that needs to be addressed in order to obtain even better results in terms of predictive performance.

show abstract

Labelling strategies for hierarchical multi-label classification techniques

Cited by 41 publications

References 34 publications

Machine learning for discovering missing or wrong protein function annotations

Machine learning for discovering missing or wrong protein function annotations

Combining Instance and Feature Neighbors for Efficient Multi-label Classification

Multi-label classification via multi-target regression on data streams

Contact Info

Product

Resources

About