Machine learning models deployed in real-world applications are often evaluated with precision-based metrics such as F1-score or AUC-PR (Area Under the Curve of Precision Recall). Heavily dependent on the class prior, such metrics make it difficult to interpret the variation of a model's performance over different subpopulations/subperiods in a dataset. In this paper, we propose a way to calibrate the metrics so that they can be made invariant to the prior. We conduct a large number of experiments on balanced and imbalanced data to assess the behavior of calibrated metrics and show that they improve interpretability and provide a better control over what is really measured. We describe specific real-world use-cases where calibration is beneficial such as, for instance, model monitoring in production, reporting, or fairness evaluation.
Multi-label classification has gained in importance in the last decade and it is today confronted to the current needs to process massive raw data from heterogeneous sources. Therefore, dimensionality reduction, which aims at reducing the number of features, labels, or both, knows a renewed interest to enhance the scaling properties of the classifiers and their predictive performances. In this paper we review more than fifty papers presenting dimensionality reduction approaches for multi-label classification and we propose an analysis in three steps : (i) a typology of the methods describing the main components of their strategies, the problem they tackle and the way they solve it (ii) a unified formalization of the problems to help to distinguish the similarities and differences between the approaches, and (iii) a meta-analysis of the published experimental results inspired by the consensus theory to identify the most efficient algorithms.
With the explosion of chatbot applications, Conversational Question Answering (CQA) has generated a lot of interest in recent years. Among proposals, reading comprehension models which take advantage of the conversation history (previous QA) seem to answer better than those which only consider the current question. Nevertheless, we note that the CQA evaluation protocol has a major limitation. In particular, models are allowed, at each turn of the conversation, to access the ground truth answers of the previous turns. Not only does this severely prevent their applications in fully autonomous chatbots, it also leads to unsuspected biases in their behavior. In this paper, we highlight this effect and propose new tools for evaluation and training in order to guard against the noted issues. The new results that we bring come to reinforce methods of the current state of the art.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.