2020
DOI: 10.1007/978-3-030-47436-2_63
|View full text |Cite
|
Sign up to set email alerts
|

An Empirical Model for n-gram Frequency Distribution in Large Corpora

Abstract: Statistical multiword extraction methods can benefit from the knowledge on the n-gram (n ≥ 1) frequency distribution in natural language corpora, for indexing and time/space optimization purposes. The appearance of increasingly large corpora raises new challenges on the investigation of the large scale behavior of the n-gram frequency distributions, not typically emerging on small scale corpora. We propose an empirical model, based on the assumption of finite n-gram language vocabularies, to estimate the numbe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(1 citation statement)
references
References 8 publications
0
1
0
Order By: Relevance
“…Bag-of-words approaches typically use 1-grams as features to represent text, while bag-ofterms approaches use a combination of n-grams. According to Silva and Cunha, n-grams do not produce major improvements in the classification performance, but the actual impact on often dependent on the application [14]. http://journals.uob.edu.bh Dimensionality reduction can be considered as a compression of the data.…”
Section:  Data Pre-processingmentioning
confidence: 99%
“…Bag-of-words approaches typically use 1-grams as features to represent text, while bag-ofterms approaches use a combination of n-grams. According to Silva and Cunha, n-grams do not produce major improvements in the classification performance, but the actual impact on often dependent on the application [14]. http://journals.uob.edu.bh Dimensionality reduction can be considered as a compression of the data.…”
Section:  Data Pre-processingmentioning
confidence: 99%