2017
DOI: 10.1007/978-3-319-69805-2_30
|View full text |Cite
|
Sign up to set email alerts
|

Frequency Consolidation Among Word N-Grams

Abstract: Abstract. This paper considers the issue of frequency consolidation in lists of different length word n-grams (i.e. recurrent word sequences) extracted from the same underlying corpus. A simple algorithmenhanced by a preparatory stage -is proposed which allows the consolidation of frequencies among lists of different length n-grams, from 2-grams to 6-grams and beyond. The consolidation adjusts the frequency count of each n-gram to the number of its occurrences minus its occurrences as part of longer n-grams. A… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
2
1

Relationship

2
1

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 22 publications
(9 reference statements)
0
3
0
Order By: Relevance
“…Remark. N-grams should not be calculated over sentence boundaries (Buerki 2017), but Huston, Moffat, and Croft (2011) showed that using sentence boundaries as separation can greatly reduce the number of 4-grams produced. Since the majority of longer n-grams will be dropped by the thresholding, empirically it does not matter whether sentence boundaries are considered in the n-gramming process.…”
Section: Multiword Expressions: Practical Significancementioning
confidence: 99%
“…Remark. N-grams should not be calculated over sentence boundaries (Buerki 2017), but Huston, Moffat, and Croft (2011) showed that using sentence boundaries as separation can greatly reduce the number of 4-grams produced. Since the majority of longer n-grams will be dropped by the thresholding, empirically it does not matter whether sentence boundaries are considered in the n-gramming process.…”
Section: Multiword Expressions: Practical Significancementioning
confidence: 99%
“…very high-frequency) words. 5 In step 2, the various lengths of identified sequences had their frequencies consolidated and were combined into a single list using Sub-String (Buerki 2017). At step 3, lexico-structural filters were applied to the lists of sequences to remove sequences that were likely to lack semantic unity.…”
Section: Identification Of Flmentioning
confidence: 99%
“…In the final step of the procedure, the frequencies of n-grams of various lengths are consolidated such that shorter sequences that are included in longer sequences are not counted multiple times, and sequences that only occur as part of longer sequences are eliminated (cf. Buerki, 2017).…”
Section: Datamentioning
confidence: 99%