Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing 2017
DOI: 10.18653/v1/d17-1008
|View full text |Cite
|
Sign up to set email alerts
|

Train-O-Matic: Large-Scale Supervised Word Sense Disambiguation in Multiple Languages without Manual Training Data

Abstract: Annotating large numbers of sentences with senses is the heaviest requirement of current Word Sense Disambiguation. We present Train-O-Matic, a languageindependent method for generating millions of sense-annotated training instances for virtually all meanings of words in a language's vocabulary. The approach is fully automatic: no human intervention is required and the only type of human knowledge used is a WordNet-like resource. Train-O-Matic achieves consistently state-of-the-art performance across gold stan… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
27
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
4
3
2

Relationship

3
6

Authors

Journals

citations
Cited by 32 publications
(27 citation statements)
references
References 25 publications
0
27
0
Order By: Relevance
“…Supervised models have been shown to consistently outperform knowledge-based ones in all standard benchmarks (Raganato et al, 2017), at the expense, however, of harder training and limited flexibility. First of all, obtaining reliable sense-annotated corpora is highly expensive and especially difficult when non-expert annotators are involved (de Lacalle and Agirre, 2015), and as a consequence approaches based on unlabeled data and semisupervised learning are emerging (Taghipour and Ng, 2015b;Başkaya and Jurgens, 2016;Yuan et al, 2016;Pasini and Navigli, 2017).…”
Section: Introductionmentioning
confidence: 99%
“…Supervised models have been shown to consistently outperform knowledge-based ones in all standard benchmarks (Raganato et al, 2017), at the expense, however, of harder training and limited flexibility. First of all, obtaining reliable sense-annotated corpora is highly expensive and especially difficult when non-expert annotators are involved (de Lacalle and Agirre, 2015), and as a consequence approaches based on unlabeled data and semisupervised learning are emerging (Taghipour and Ng, 2015b;Başkaya and Jurgens, 2016;Yuan et al, 2016;Pasini and Navigli, 2017).…”
Section: Introductionmentioning
confidence: 99%
“…Baseline Methods: The baselines include some state-of-the-art approaches, i.e., MFS (to directly output the Most Frequent Sense in WordNet); IMS (Zhi and Ng, 2010), a classifier working on several handcrafted features, i.e., POS, surrounding words and local collocations; Babelfy (Moro, Raganato, and Navigli, 2014), a state-of-the-art knowledge-based WSD system exploiting random walks to connect synsets and text fragments; Lesk ext+emb (Basile, Caputo, and Semeraro, 2014a), an extension of Lesk by incorporating similarity information of definitions; UKB gloss (Agirre and Soroa, 2009;Agirre, de Lacalle, and Soroa, 2014), another graph-based method for WSD; A joint learning model for WSD and entity linking (EL) utilizing semantic resources by Weissenborn et al (2015); IMS-s+emb (Iacobacci, , the combination of original IMS and word embeddings through exponential decay while surrounding words are removed from features; Context2vec (Melamud, Goldberger, and Dagan, 2016), a generic model for generating representation of context for WSD; Jointly training LSTM with labeled and unlabeled data (Le, Postma, and Urbani, 2017) data, which is roughly equal to the size of unlabeled corpus in our work. This makes the comparison more fair); A model jointly learns to predict word senses, POS and coarse-grained semantic labels by Raganato, Bovi, and Navigli (2017); Train-O-Matic (Pasini and Navigli, 2017), a language-independent approach for generating sense-labeled data automatically based on random walk in WordNet and training a classifier on it. Datasets: We choose Semcor 3.0 (Miller et al, 1994) (226,036 manual sense annotations), which is also used by baselines, as the manually labeled data.…”
Section: Setupmentioning
confidence: 99%
“…With respect to universal excessive dependence on external resources, Pasini [10] proposed a multilingual disambiguation system that does not use manual annotated training data. Panchenko [11] also proposed an unsupervised disambiguation method without the use of external knowledge.…”
Section: State Of the Artmentioning
confidence: 99%