Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen 2019
DOI: 10.18653/v1/d19-1439
|View full text |Cite
|
Sign up to set email alerts
|

WikiCREM: A Large Unsupervised Corpus for Coreference Resolution

Abstract: Pronoun resolution is a major area of natural language understanding. However, large-scale training sets are still scarce, since manually labelling data is costly. In this work, we introduce WIKICREM (Wikipedia CoREferences Masked) a large-scale, yet accurate dataset of pronoun disambiguation instances. We use a language-model-based approach for pronoun resolution in combination with our WI-KICREM dataset. We compare a series of models on a collection of diverse and challenging coreference resolution problems,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
13
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 24 publications
(13 citation statements)
references
References 20 publications
0
13
0
Order By: Relevance
“…Much research has focused on resolving under-resourced tasks via semi-supervised learning (Pekar et al, 2014;Kocijan et al, 2019;Hou, 2020) and shared representation based transfer learning (Yang et al, 2017;Cotterell and Duh, 2017;Zhou et al, 2019).…”
Section: Approaches For Under-resourced Tasksmentioning
confidence: 99%
See 1 more Smart Citation
“…Much research has focused on resolving under-resourced tasks via semi-supervised learning (Pekar et al, 2014;Kocijan et al, 2019;Hou, 2020) and shared representation based transfer learning (Yang et al, 2017;Cotterell and Duh, 2017;Zhou et al, 2019).…”
Section: Approaches For Under-resourced Tasksmentioning
confidence: 99%
“…Recently, another line of research focused on creating synthetic training data from unlabelled/automatically labelled data using some heuristic patterns. Kocijan et al (2019) used Wikipedia to create WikiCREM, a large pronoun resolution dataset, using heuristic rules based on the occurrence of personal names in sentences. The evaluation of their system on Winograd Schema corpora shows that models pre-trained on the WikiCREM consistently outperform models that do not use it.…”
Section: Approaches For Under-resourced Tasksmentioning
confidence: 99%
“…Liu et al (2019) uses a language model objective to train a memory network, which can resolve coreference links. Kocijan et al (2019) finds pairs of sentences with at least two distinct personal names such that one of them is repeated. One non-first occurrence of the repeated candidate is masked, and the goal is to predict the masked name, given the correct and one incorrect candidates.…”
Section: Fine-tuningmentioning
confidence: 99%
“…Two recent attempts for pre-training coreference models have focused on tasks such as language modeling (Liu et al, 2019) and maskedword-prediction for name resolution (Kocijan et al, 2019). Here we propose self-supervision tasks that train the coreference model directly (rather than just the underlying BERT), resulting in improved mention representations and resolution accuracy.…”
Section: Introductionmentioning
confidence: 99%
“…2 The challenge required systems to resolve such ambiguous pronouns to one of two given antecedents. Eight years later, in 2019, a team from Oxford, Imperial, and the Alan Turing Institute (Kocijan et al 2019b) showed that fine-tuning a BERT language model with several million artificial Winograd schema sentences extracted from the Wikipedia could enable that model to correctly resolve over 75% percent of the examples in the Challenge, and also improve the performance of systems in resolving more prosaic examples of personal pronoun anaphors (Kocijan et al 2019a). Now, although this is very impressive and something we should learn from, it is unclear how performance can further improve, given that it reflects general patterns of personal pronoun use and not any specific understanding of coreference.…”
mentioning
confidence: 99%