2000
DOI: 10.4218/etrij.00.0100.0203
|View full text |Cite
|
Sign up to set email alerts
|

An Algorithm for Predicting the Relation between Lemmas and Corpus Size

Abstract: Much research on natural language processing (NLP), computational linguistics and lexicography has relied and depended on linguistic corpora. In recent years, many organizations around the world have been constructing their own large corpora to achieve corpus representativeness and/or linguistic comprehensiveness. However, there is no reliable guideline as to how large machine readable corpus resources should be compiled to develop practical NLP software and/or complete dictionaries for humans and computationa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
5
0

Year Published

2001
2001
2012
2012

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(5 citation statements)
references
References 3 publications
0
5
0
Order By: Relevance
“…They concluded that the UMLS Metathesaurus‐based CLIR method was at least equivalent to, if not better than, multilingual‐dictionary‐based approaches. Yang, Gomez, and Song (2000) observed that there was no reliable guideline as to how large, machine readable corpora should be compiled to develop practical NLP software packages and/or complete dictionaries for humans and computational use. They proposed a new mathematical tool—a piecewise curve‐fitting algorithm—and suggested how to determine the tolerance error of the algorithm for good prediction, using a specific corpus.…”
Section: Machine Translation and Clirmentioning
confidence: 99%
“…They concluded that the UMLS Metathesaurus‐based CLIR method was at least equivalent to, if not better than, multilingual‐dictionary‐based approaches. Yang, Gomez, and Song (2000) observed that there was no reliable guideline as to how large, machine readable corpora should be compiled to develop practical NLP software packages and/or complete dictionaries for humans and computational use. They proposed a new mathematical tool—a piecewise curve‐fitting algorithm—and suggested how to determine the tolerance error of the algorithm for good prediction, using a specific corpus.…”
Section: Machine Translation and Clirmentioning
confidence: 99%
“…Previous approaches have dealt with syntactic term mismatches caused by noun phrases in three ways. The first approach has tried to normalize noun phrases by the statistical term cooccurrence method (Fagon 1989;Yun, Cho, and Rim 1997;Yang 2000). This method defines noun phrases as two or more single terms that occur frequently in close proximity to each other in the texts.…”
Section: Methods To Handle Syntactic Term Mismatchmentioning
confidence: 99%
“…However, it is impractical to get all the necessary triphone instances [8]. For the prosodic aspect, each triphone should retain enough instances to cover possible prosodic variations appearing in the utterances as the number of necessary triphone instances may reach over several millions in the end.…”
Section: Design Of Recording Sentence Setmentioning
confidence: 99%
“…For the prosodic aspect, each triphone should retain enough instances to cover possible prosodic variations appearing in the utterances as the number of necessary triphone instances may reach over several millions in the end. However, it is impractical to get all the necessary triphone instances [8].…”
Section: Design Of Recording Sentence Setmentioning
confidence: 99%