Automatic retrieval and clustering of similar words

Lin, Dekang

doi:10.3115/980432.980696

Cited by 643 publications

(822 citation statements)

References 7 publications

(11 reference statements)

Supporting

Mentioning

806

Contrasting

Unclassified

Order By: Relevance

“…The methods implemented in the WordNet::Similarity software package (Pedersen et al 2004) determine how close two words are in WordNet. These methods are J&C (Jiang and Conrath 1997), Res (Resnik 1995), Lin (Lin 1998a), W&P (Wu and Palmer 1994), L&C (Leacock and Chodorow 1998), H&SO (Hirst and St-Onge 1998), Path (counts edges between synsets), Lesk (Banerjee and Pedersen 2002), and finally Vector and Vector Pair (Patwardhan et al 2003). The measure most similar to the edgeScore method is the Path measure in WordNet.…”

Section: Parse Wikipedia With Minipar (Lin 1998amentioning

confidence: 99%

“…We used Wikipedia 6 as a source of data and parsed it with MINI-PAR (Lin 1998a). The choice of dependency triples instead of all neighbouring words favours contexts which most directly affect a word's meaning.…”

Section: Building a Word-context Matrix For Semantic Relatednessmentioning

confidence: 99%

“…2 The API has been built on the work of Jarmasz (2003 Figure 1 outlines the process of updating Roget's Thesaurus. We work with Wikipedia as a corpus and with the parser MINIPAR (Lin 1998a). Raw text is parsed, and a word-context matrix is constructed and re-weighted in both a supervised and an unsupervised manner.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Evaluation of automatic updates of Roget’s Thesaurus

Kennedy

Śzpakowicz

2014

JLM

View full text Add to dashboard Cite

Thesauri and similarly organised resources attract increasing interest of Natural Language Processing researchers. Thesauri age fast, so there is a constant need to update their vocabulary. Since a manual update cycle takes considerable time, automated methods are required. This work presents a tuneable method of measuring semantic relatedness, trained on Roget's Thesaurus, which generates lists of terms related to words not yet in the Thesaurus. Using these lists of terms, we experiment with three methods of adding words to the Thesaurus. We add, with high confidence, over 5500 and 9600 new word senses to versions of Roget's Thesaurus from 1911 and 1987 respectively. We evaluate our work both manually and by applying the updated thesauri in three NLP tasks: selection of the best synonym from a set of candidates, pseudo-word-sense disambiguation and SAT-style analogy problems. We find that the newly added words are of high quality. The additions significantly improve the performance of Roget's-based methods in these NLP tasks. The performance of our system compares favourably with that of WordNet-based methods. Our methods are general enough to work with different versions of Roget's Thesaurus.

show abstract

Section: Parse Wikipedia With Minipar (Lin 1998amentioning

confidence: 99%

Section: Building a Word-context Matrix For Semantic Relatednessmentioning

confidence: 99%

See 1 more Smart Citation

Evaluation of automatic updates of Roget’s Thesaurus

Kennedy

Śzpakowicz

2014

JLM

View full text Add to dashboard Cite

show abstract

“…the phenomenon that errors in previous iterations have a deteriorating effect on the accuracy of later iterations McIntosh and Curran (2009). To dampen this effect, distributional similarity (Lin, 1998;van der Plas, 2008) was used to filter instance pairs where the first element is not distributionally similar to the group of soccer players or where the second element in not similar to soccer clubs. The results for this method are given in the final two columns.…”

Section: Capital-of # Patterns # Pairs (P) 1st Ans Ok Mrrmentioning

confidence: 99%

Relation Extraction for Open and Closed Domain Question Answering

Bouma

Fahmi²,

Mur³

2011

Interactive Multi-Modal Question-Answering

View full text Add to dashboard Cite

One of the most accurate methods in Question Answering uses off-line information extraction to find answers for frequently asked questions. It requires automatic extraction from text of all relation instances for relations that users frequently ask for. In this chapter, we present two methods for learning relation instances for relations relevant in a closed and open domain (medical) question answering system. Both methods try to learn automatically dependency paths that typically connect two arguments of a given relation. The first (lightly supervised) method starts from a seed list of argument instances, and extracts dependency paths from all sentences in which a seed pair occurs. This method works well for large text collections and for seeds which are easily identified, such as named entities, and is well-suited for open domain question answering. In a second experiment, we concentrate on medical relation extraction for the question answering module of the IMIX system. The IMIX corpus is relatively small and relation instances may contain complex noun phrases that do not occur frequently in the exact same form in the corpus. In this case, learning from annotated data is necessary. We show that dependency patterns enriched with semantic concept labels give accurate results for relations that are relevant for a medical question answering system. Both methods improve the performance of the Dutch question answering system Joost.

show abstract

“…We used three sources to automatically expand the Seed Lexicon: WordNet [3], Lin's distributional thesaurus [4], and a pivot-based paraphrase generation tool [5]. The resulting lexicons will be called Raw WN, Raw Lin, and Raw Para, respectively; they were created as follows.…”

Section: Automatically Expanding the Seed Lexiconmentioning

confidence: 99%

Building Subjectivity Lexicon(s) from Scratch for Essay Data

Klebanov

Burstein

Madnani

et al. 2012

Computational Linguistics and Intelligent Text Processing

View full text Add to dashboard Cite

Abstract.While there are a number of subjectivity lexicons available for research purposes, none can be used commercially. We describe the process of constructing subjectivity lexicon(s) for recognizing sentiment polarity in essays written by test-takers, to be used within a commercial essay-scoring system. We discuss ways of expanding a manually-built seed lexicon using dictionary-based, distributional indomain and out-of-domain information, as well as using Amazon Mechanical Turk to help "clean up" the expansions. We show the feasibility of constructing a family of subjectivity lexicons from scratch using a combination of methods to attain competitive performance with state-of-art research-only lexicons. Furthermore, this is the first use, to our knowledge, of a paraphrase generation system for expanding a subjectivity lexicon.

show abstract

Automatic retrieval and clustering of similar words

Cited by 643 publications

References 7 publications

Evaluation of automatic updates of Roget’s Thesaurus

Evaluation of automatic updates of Roget’s Thesaurus

Relation Extraction for Open and Closed Domain Question Answering

Building Subjectivity Lexicon(s) from Scratch for Essay Data

Contact Info

Product

Resources

About