ESTER: A Machine Reading Comprehension Dataset for Reasoning about Event Semantic Relations

Han, Rujun; Hsu, I-Hung; Sun, Jiao; Baylon, Julia; Ning, Qiang; Roth, Dan; Peng, Nanyun

doi:10.18653/v1/2021.emnlp-main.597

Cited by 17 publications

(7 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Rogers et al [33] proposes an "evidence format" for the explainable part of a dataset composed of Modality (Unstructured text, Semi-structured text, Structured knowledge, Images, Audio, Video, Other combinations) and Amount of evidence (Single source, Multiple sources, Partial source, No sources). (a) spatial reasoning: bAbI [107], SpartQA [108] (b) temporal reasoning: event order (QuAIL [109], TORQUE [110]), event attribution to time (TEQUILA [111], TempQuestions [112], script knowledge (MCScript [113]), event duration (MCTACO [114], QuAIL [109]), temporal commonsense knowledge (MCTACO [114], TIMEDIAL [115]), factoid/news questions with answers where the correct answers change with time (ArchivalQA [116], SituatedQA [117]), temporal reasoning in multimodal setting [DAGA [118], TGIF-QA [119]; (c) belief states: Event2Mind [120], QuAIL [109]; (d) causal relations: ROPES [121], QuAIL [109], QuaRTz [122], ESTER [123]; (e) other relations between events: subevents, conditionals, counterfactuals etc. ESTER [123]; (f) entity properties and relations : 20 social interactions (SocialIQa [124]), properties of characters (QuAIL [109]), physical properties (PIQA [125], QuaRel [126]), numerical properties (NumberSense [127]); (g) tracking entities: across locations (bAbI [arXiv:1502.05698]), in coreference chains (Quoref [128],…”

Section: Big Bench Datasets Formentioning

confidence: 99%

Complex QA and language models hybrid architectures, Survey

Daull¹,

Bellot²,

Bruno³

et al. 2023

Preprint

View full text Add to dashboard Cite

This paper provides a survey of the state of the art of hybrid language models architectures and strategies for "complex" question-answering (QA, CQA, CPS). Very large language models are good at leveraging public data on standard problems but once you want to tackle more specific complex questions or problems you may need specific architecture, knowledge, skills, tasks, methods, sensitive data, performance, human approval and versatile feedback... This survey extends findings from the robust community edited research papers BIG, BLOOM and HELM which open source, benchmark and analyze limits and challenges of large language models in terms of tasks complexity and strict evaluation on accuracy (e.g. fairness, robustness, toxicity, ...). It identifies the key elements used with Large Language Models (LLM) to solve complex questions or problems. Recent projects like ChatGPT and GALACTICA have allowed non-specialists to grasp the great potential as well as the equally strong limitations of language models in complex QA. Hybridizing these models with different components could allow to overcome these different limits and go much further. We discuss some challenges associated with complex QA, including domain adaptation, decomposition and efficient multi-step QA, long form QA, non-factoid QA, safety and multi-sensitivity data protection, multimodal search, hallucinations, QA explainability and truthfulness, time dimension. Therefore we review current solutions and promising strategies, using elements such as hybrid LLM architectures, human-in-the-loop reinforcement learning, prompting adaptation, neuro-symbolic and structured knowledge grounding, program synthesis, and others. We analyze existing solutions and provide an overview of the current research and trends in the area of complex QA.

show abstract

Section: Big Bench Datasets Formentioning

confidence: 99%

Complex QA and language models hybrid architectures, Survey

Daull¹,

Bellot²,

Bruno³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…These datasets are in different formats such as NLI, Question Answering (QA), and Reading Comprehension (RC). They target a large set of skills including monotonicity (Yanaka et al, 2019a), deductive logic , event semantics (Han et al, 2021), physical and social commonsense (Sap et al, 2019;Bisk et al, 2019), defeasible reasoning (Rudinger et al, 2020), and more. Our work brings together a set of challenge datasets to build a benchmark covering a large set of specific linguistic skills.…”

Section: Related Workmentioning

confidence: 99%

Curriculum: A Broad-Coverage Benchmark for Linguistic Phenomena in Natural Language Understanding

Chen¹,

Gao²

2022

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

In the age of large transformer language models, linguistic evaluation play an important role in diagnosing models' abilities and limitations on natural language understanding. However, current evaluation methods show some significant shortcomings. In particular, they do not provide insight into how well a language model captures distinct linguistic skills essential for language understanding and reasoning. Thus they fail to effectively map out the aspects of language understanding that remain challenging to existing models, which makes it hard to discover potential limitations in models and datasets. In this paper, we introduce CUR-RICULUM as a new format of NLI benchmark for evaluation of broad-coverage linguistic phenomena. CURRICULUM contains a collection of datasets that covers 36 types of major linguistic phenomena and an evaluation procedure for diagnosing how well a language model captures reasoning skills for distinct types of linguistic phenomena. We show that this linguisticphenomena-driven benchmark can serve as an effective tool for diagnosing model behavior and verifying model learning quality. In addition, our experiments provide insight into the limitation of existing benchmark datasets and state-of-the-art models that may encourage future research on re-designing datasets, model architectures, and learning objectives. 1 .

show abstract

“…When study units are organized textual data, we find it meaningful to further divide observed covariates into two broad categories: "explicit observed covariates" that could be derived from the organized textual data at face value, e.g., the number of theorems/equations/figures in a conference paper, and "implicit observed covariates" that capture deeper aspects intrinsic to the textual data. Some concrete examples of implicit covariates include: bag-of-words embeddings such as Word2Vec (Mikolov et al, 2013) and GloVe (Pennington et al, 2014), and contextual embeddings such as BERT (Devlin et al, 2019) and Sen-tenceBERT (Reimers and Gurevych, 2019); perceived sentiments, tones, and emotions from the text (Barbieri et al, 2020;Pérez et al, 2021); topic modeling and keyword summarizing (Xie et al, 2015;Blei and Lafferty, 2007;Ramage et al, 2009;Wang et al, 2020;Santosh et al, 2020); evaluated trustworthiness of the claims made (Nadeem et al, 2019;Zhang et al, 2021b); temporal relationships and semantic relationships of events mentioned (Zhou et al, 2021;Han et al, 2021); commonsense knowledge reasoning (such as complex relations between events, consequences, and predictions) based on the text (Chaturvedi et al, 2017;Speer et al, 2017;Hwang et al, 2021;Jiang et al, 2021). These are by no means exhaustive; nor are they necessary for each and every causal query.…”

Section: A Dichotomy Of Covariatesmentioning

confidence: 99%

Some Reflections on Drawing Causal Inference using Textual Data: Parallels Between Human Subjects and Organized Texts

Zhang¹,

Zhang²

2022

Preprint

View full text Add to dashboard Cite

We examine the role of textual data as study units when conducting causal inference by drawing parallels between human subjects and organized texts. We elaborate on key causal concepts and principles, and expose some ambiguity and sometimes fallacies. To facilitate better framing a causal query, we discuss two strategies: (i) shifting from immutable traits to perceptions of them, and (ii) shifting from some abstract concept/property to its constituent parts, i.e., adopting a constructivist perspective of an abstract concept. We hope this article would raise the awareness of the importance of articulating and clarifying fundamental concepts before delving into developing methodologies when drawing causal inference using textual data.

show abstract

ESTER: A Machine Reading Comprehension Dataset for Reasoning about Event Semantic Relations

Cited by 17 publications

References 18 publications

Complex QA and language models hybrid architectures, Survey

Complex QA and language models hybrid architectures, Survey

Curriculum: A Broad-Coverage Benchmark for Linguistic Phenomena in Natural Language Understanding

Some Reflections on Drawing Causal Inference using Textual Data: Parallels Between Human Subjects and Organized Texts

Contact Info

Product

Resources

About