Proceedings of the 39th Annual Meeting on Association for Computational Linguistics - ACL '01 2001
DOI: 10.3115/1073012.1073017
|View full text |Cite
|
Sign up to set email alerts
|

Scaling to very very large corpora for natural language disambiguation

Abstract: The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously b… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

10
322
3
6

Year Published

2009
2009
2020
2020

Publication Types

Select...
5
5

Relationship

0
10

Authors

Journals

citations
Cited by 468 publications
(341 citation statements)
references
References 18 publications
10
322
3
6
Order By: Relevance
“…An important future direction lies in expanding the corpus. Increasing the amount of data can be beneficial for machine learning algorithms (Banko and Brill 2001). Therefore, we should expand the corpus in terms of size which could be done via focused search (as for LWGC-B) or by annotating random web pages (as for LWGC-R).…”
Section: Discussionmentioning
confidence: 99%
“…An important future direction lies in expanding the corpus. Increasing the amount of data can be beneficial for machine learning algorithms (Banko and Brill 2001). Therefore, we should expand the corpus in terms of size which could be done via focused search (as for LWGC-B) or by annotating random web pages (as for LWGC-R).…”
Section: Discussionmentioning
confidence: 99%
“…Do other types of anomaly detectors, or more generally, learning algorithms for other data mining tasks also exhibit the gravity-defiant behaviour? In a complex domain such as natural language processing, millions of additional data has been shown to continue to improve the performance of trained models (Halevy et al 2009;Banko and Brill 2001). Is this the domain for which algorithms always comply with the learning curve?…”
Section: Implications and Potential Future Workmentioning
confidence: 99%
“…Earlier work in ESL error correction follows the methodology of the context-sensitive spelling correction task (Golding and Roth, 1996;Golding and Roth, 1999;Banko and Brill, 2001;Carlson et al, 2001;Carlson and Fette, 2007). Most of the effort in ESL error correction so far has been on article and preposition usage errors, as these are some of the most common mistakes among non-native English speakers (Dalgish, 1985;Leacock et al, 2010).…”
Section: Related Workmentioning
confidence: 99%