Possible Stories: Evaluating Situated Commonsense Reasoning under Multiple Possible Scenarios

Ashida, Mana; Sugawara, Saku

doi:10.48550/arxiv.2209.07760

Cited by 1 publication

(2 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…identifying the rhyme scheme in a poem or increasing the contrast in an image. 6 In the other direction, there can be evidence of various kinds, beyond performance on a commonsense benchmark, that an AI system does or does not use commonsense reasoning in carrying out a task. In a knowledge-based system, one can often see what knowledge is being used and how; if commonsense knowledge is being used in a significant way in carrying out a particular task, then presumably this is to some extent a commonsense task.…”

Section: An Untrue Claim About Commonsense Knowledgementioning

confidence: 99%

“…Size Construction PIQA [12] Physical interaction QA 20,000 questions Crowd sourcing Possible Stories [6] Counterfactual 1313 texts Crowd sourcing narratives 4533 questions PROST [5] Physical reasoning 18,736 questions Expert-written cloze task template. ProtoQA [15] Reasoning about 9700 questions Crowd sourcing prototypical situations ReCoRD [159] Cloze question 120,000 questions Extracted from about news stories online news source.…”

Section: Taskmentioning

confidence: 99%

See 1 more Smart Citation

Benchmarks for Automated Commonsense Reasoning: A Survey

Davis¹

2023

Preprint

View full text Add to dashboard Cite

More than one hundred benchmarks have been developed to test the commonsense knowledge and commonsense reasoning abilities of artificial intelligence (AI) systems. However, these benchmarks are often flawed and many aspects of common sense remain untested. Consequently, we do not currently have any reliable way of measuring to what extent existing AI systems have achieved these abilities.This paper surveys the development and uses of AI commonsense benchmarks. We discuss the nature of common sense; the role of common sense in AI; the goals served by constructing commonsense benchmarks; and desirable features of commonsense benchmarks. We analyze the common flaws in benchmarks, and we argue that it is worthwhile to invest the work needed ensure that benchmark examples are consistently high quality. We survey the various methods of constructing commonsense benchmarks. We enumerate 139 commonsense benchmarks that have been developed: 102 text-based, 18 image-based, 12 video based, and 7 simulated physical environments. We discuss the gaps in the existing benchmarks and aspects of commonsense reasoning that are not addressed in any existing benchmark. We conclude with a number of recommendations for future development of commonsense AI benchmarks.

show abstract

Section: An Untrue Claim About Commonsense Knowledgementioning

confidence: 99%