2023
DOI: 10.31219/osf.io/rvy5p
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Testing the Reliability of ChatGPT for Text Annotation and Classification: A Cautionary Remark

Abstract: Recent studies have demonstrated promising potential of ChatGPT for various text annotation and classification tasks. However, ChatGPT is non-deterministic which means that, as with human coders, identical input can lead to different outputs. Given this, it seems appropriate to test the reliability of ChatGPT. Therefore, this study investigates the consistency of ChatGPT’s zero-shot capabilities for text annotation and classification, focusing on different model parameters, prompt variations, and repetitions o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
11
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 31 publications
(11 citation statements)
references
References 10 publications
0
11
0
Order By: Relevance
“…ChatGPT) can be successfully used to classify texts or identify false information [11,19,46], the prevalence of factually incorrect responses that we observed in our study stresses the relevance of more critical assessments (e.g. [34]) which call for extreme caution when applying LLMs for such tasks, especially when working with non-English content. Currently, it is unclear in which cases the LLMs perform better or worse with regard to generating factually correct information and how (in)consistent their performance is across different languages.…”
Section: Discussionmentioning
confidence: 52%
See 1 more Smart Citation
“…ChatGPT) can be successfully used to classify texts or identify false information [11,19,46], the prevalence of factually incorrect responses that we observed in our study stresses the relevance of more critical assessments (e.g. [34]) which call for extreme caution when applying LLMs for such tasks, especially when working with non-English content. Currently, it is unclear in which cases the LLMs perform better or worse with regard to generating factually correct information and how (in)consistent their performance is across different languages.…”
Section: Discussionmentioning
confidence: 52%
“…Currently, it is unclear in which cases the LLMs perform better or worse with regard to generating factually correct information and how (in)consistent their performance is across different languages. Hence, human validation is essential at this stage of LLM development before these models can be deployed for downstream tasks [34].…”
Section: Discussionmentioning
confidence: 99%
“…This study had several limitations. Primarily, the AI-generated diet plans exhibited a degree of inconsistency, likely owing to inherent randomness in the output generation of the AI chatbot model (23,24). The model was engineered to produce a range of responses rather than consistently offering identical solutions.…”
Section: Discussionmentioning
confidence: 99%
“…Thorough validation, including comparisons against human-annotated reference data, is imperative in addressing these concerns and ensuring the accuracy and reliability of ChatGPT's outputs for data labeling. Therefore, crafting clear and precise instructions for data labeling tasks is paramount [23].…”
Section: Relevant Researchmentioning
confidence: 99%