2022
DOI: 10.1177/20539517221129549
|View full text |Cite
|
Sign up to set email alerts
|

The unbearable (technical) unreliability of automated facial emotion recognition

Abstract: Emotion recognition, and in particular acial emotion recognition (FER), is among the most controversial applications of machine learning, not least because of its ethical implications for human subjects. In this article, we address the controversial conjecture that machines can read emotions from our facial expressions by asking whether this task can be performed reliably. This means, rather than considering the potential harms or scientific soundness of facial emotion recognition systems, focusing on the reli… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 13 publications
(5 citation statements)
references
References 84 publications
0
4
0
Order By: Relevance
“…We note that representativeness cannot be given for granted with case-wise majority voting. For instance, in a crowdsourcing study (Cabitza, Campagner, and Mattioli 2022) in the emotion recognition field, we observed thatthe involved raters agreed with the majority judgement only less than once in two times (average alpha: .44) on average. A similar agreement was observed also in a radiological study (Cabitza et al 2020), where the authors involved 13 experts to diagnose 427 images: there the average agreement was higher (alpha: .76) but no radiologist agreed with the majority decision more than 89% of cases.…”
Section: Strong and Weak Perspectivismmentioning
confidence: 90%
See 1 more Smart Citation
“…We note that representativeness cannot be given for granted with case-wise majority voting. For instance, in a crowdsourcing study (Cabitza, Campagner, and Mattioli 2022) in the emotion recognition field, we observed thatthe involved raters agreed with the majority judgement only less than once in two times (average alpha: .44) on average. A similar agreement was observed also in a radiological study (Cabitza et al 2020), where the authors involved 13 experts to diagnose 427 images: there the average agreement was higher (alpha: .76) but no radiologist agreed with the majority decision more than 89% of cases.…”
Section: Strong and Weak Perspectivismmentioning
confidence: 90%
“…In fact, real-world settings show that disagreement is unavoidable and essentially irreducible, especially when the objects to classify are so complex that most of the raters can actually get them wrong, and the real experts are a minority (Basile 2021;Cabitza et al 2019); or when the objects are so ambiguous, as it often happens in Natural Language Processing (NLP) (Artstein and Poesio 2008), emotion recognition (Cabitza, Campagner, and Mattioli 2022) or Computer Vision (Yun et al 2021), that disagreement between annotators may embed valuable nuances challenging the very idea of clear-cut classification (Aroyo and Welty 2015). Moreover, the ambiguity and complexity of objects and cases to be interpreted can lead to high disagreement among raters not only in the notoriously subjective domains mentioned above, but also in seemingly objective disciplines like medicine or engineering: for instance, as considered in (Chernova and Veloso 2010), training a self-driving vehicle may involve states in which multiple actions are perfectly reasonable; Schaekermann et al (2019) reported a disagreement rate of over 50% in the identification of Parkinson, which could not be completely eliminated even after Delphilike group deliberation; similarly, Cabitza et al (2019) reported poor agreement between clinicians even in merely descriptive tasks, when they were called to describe electrocardiograms they had just read or surgical operations they had attended in presence.…”
mentioning
confidence: 99%
“…Contextual variables can be measured and quantified by human annotators [ 86 ]. When facial expressions are presented with perceptually rich contextual information, human annotators show substantially greater agreement for labeling facial expressions than decontextualized faces [ 87 ]. This indicates that the current limitations of evaluating facial expressions with FER systems could be addressed by including contextual cues, as human perceivers can make more robust, reliable emotion inferences.…”
Section: Analyzing Naturalistic Facial Expressions With Deep Learningmentioning
confidence: 99%
“…Moreover, when images achieve a high degree of agreement, labels can be defined using only one emotionas currently done in most existing FER datasets-but it could be interesting to have two (or more) labels when humans tend to disagree. Such low data reliability yields crucial challenges for FER technologies [33]. In this context, FER models could learn from several emotions, as well as a human percentage of confidence, which would help improve their performance.…”
Section: On a New Ground Truth Definitionmentioning
confidence: 99%