An Axiomatic Analysis of Diversity Evaluation Metrics

Amigó, Enrique; Spina, Damiano; Carrillo-de-Albornoz, Jorge

doi:10.1145/3209978.3210024

Cited by 41 publications

(39 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, while their results showed that CEM ORD is similar to all of these gold measures, the outcome may differ if we choose a different set of gold measures. Indeed, in the context of evaluating information retrieval evaluation measures, demonstrated that a similar meta-evaluation approach called unanimity (Amigó et al, 2018) depends heavily on the choice of gold measures. Moreover, while Amigó et al (2020) reported that CEM ORD also performs well in terms of consistency of system rankings across different data (which they refer to as "robustness"), experimental details were not provided in their paper.…”

Section: Evaluating Ordinal Classificationmentioning

confidence: 99%

Evaluating Evaluation Measures for Ordinal Classification and Ordinal Quantification

Sakai¹

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Ordinal Classification (OC) is an important classification task where the classes are ordinal. For example, an OC task for sentiment analysis could have the following classes: highly positive, positive, neutral, negative, highly negative. Clearly, evaluation measures for an OC task should penalise misclassifications by considering the ordinal nature of the classes (e.g., highly positive misclassified as positive vs. misclassifed as highly negative). Ordinal Quantification (OQ) is a related task where the gold data is a distribution over ordinal classes, and the system is required to estimate this distribution. Evaluation measures for an OQ task should also take the ordinal nature of the classes into account. However, for both OC and OQ, there are only a small number of known evaluation measures that meet this basic requirement. In the present study, we utilise data from the SemEval and NTCIR communities to clarify the properties of nine evaluation measures in the context of OC tasks, and six measures in the context of OQ tasks.

show abstract

Section: Evaluating Ordinal Classificationmentioning

confidence: 99%

Evaluating Evaluation Measures for Ordinal Classification and Ordinal Quantification

Sakai¹

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

show abstract

“…3 presents the EMM standardized performance scores of all metric-based losses except . These overall results show that, in 2 The inclusion of dataset and NSR main effects does not inform the model in any way because of the standardization, but we keep them to follow the hierarchy principle of linear models.…”

Section: Ismentioning

confidence: 92%

New Insights into Metric Optimization for Ranking-based Recommendation

Urbano

Hanjalic

2021

Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

Direct optimization of IR metrics has often been adopted as an approach to devise and develop ranking-based recommender systems. Most methods following this approach (e.g. TFMAP, CLiMF, Top-N-Rank) aim at optimizing the same metric being used for evaluation, under the assumption that this will lead to the best performance. A number of studies of this practice bring this assumption, however, into question. In this paper, we dig deeper into this issue in order to learn more about the effects of the choice of the metric to optimize on the performance of a ranking-based recommender system. We present an extensive experimental study conducted on different datasets in both pairwise and listwise learning-to-rank (LTR) scenarios, to compare the relative merit of four popular IR metrics, namely , , and , when used for optimization and assessment of recommender systems in various combinations. For the first three, we follow the practice of loss function formulation available in literature. For the fourth one, we propose novel loss functions inspired by for both the pairwise and listwise scenario. Our results confirm that the best performance is indeed not necessarily achieved when optimizing the same metric being used for evaluation. In fact, we find that -inspired losses perform at least as well as other metrics in a consistent way, and offer clear benefits in several cases. Interesting to see is that -inspired losses, while improving the recommendation performance for all uses, may lead to an individual performance gain that is correlated with the activity level of a user in interacting with items. The more active the users, the more they benefit. Overall, our results challenge the assumption behind the current research practice of optimizing and evaluating the same metric, and point to -based optimization instead as a promising alternative when learning to rank in the recommendation context. CCS CONCEPTS• Information systems → Recommender systems; Learning to rank; • General and reference → Metrics.

show abstract

“…Although this work is mainly theoretical, we performed a brief experiment comparing OIE against traditional metrics. Here, we use the meta-metric Metric Unanimity (MU) [4]. MU quantifies to what extent a metric is sensitive to quality aspects captured by other existing metrics.…”

Section: Methodsmentioning

confidence: 99%

“…Traditional metrics and Observational Information Effectiveness (OIE), ranked by Metric Unanimity (MU)[4].indicates that the metric satisfies the formal constraint, indicates otherwise.…”

mentioning

confidence: 99%

A Formal Account of Effectiveness Evaluation and Ranking Fusion

Amigó

Giner

Mizzaro

et al. 2018

Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval

Self Cite

View full text Add to dashboard Cite

This paper proposes a theoretical framework which models the information provided by retrieval systems in terms of Information Theory. The proposed framework allows to formalize: (i) system effectiveness as an information theoretic similarity between system outputs and human assessments, and (ii) ranking fusion as an information quantity measure. As a result, the proposed effectiveness metric improves popular metrics in terms of formal constraints. In addition, our empirical experiments suggest that it captures quality aspects from traditional metrics, while the reverse is not true. Our work also advances the understanding of theoretical foundations of the empirically known phenomenon of effectiveness increase when combining retrieval system outputs in an unsupervised manner.

show abstract

An Axiomatic Analysis of Diversity Evaluation Metrics

Cited by 41 publications

References 28 publications

Evaluating Evaluation Measures for Ordinal Classification and Ordinal Quantification

Evaluating Evaluation Measures for Ordinal Classification and Ordinal Quantification

New Insights into Metric Optimization for Ranking-based Recommendation

A Formal Account of Effectiveness Evaluation and Ranking Fusion

Contact Info

Product

Resources

About