2002
DOI: 10.1145/595576.595578
|View full text |Cite
|
Sign up to set email alerts
|

Toward a unified approach to statistical language modeling for Chinese

Abstract: This article presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigram language models to Chinese is challenging because (1) there is no standard definition of words in Chinese; (2) word boundaries are not marked by spaces; and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a high-quality training data set from the Web, creates a high-quality lexicon, segments the training data using this lexicon, and co… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
67
0

Year Published

2005
2005
2024
2024

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 97 publications
(68 citation statements)
references
References 14 publications
1
67
0
Order By: Relevance
“…The basic architecture of ASR is inspired by the human auditory system. [7] Speech utterances of word sequence W is decided by speaker mind and delivered through his / her text generator [6]. Speaker vocal apparatus contain the signal processing component and produce the speech waveform which passes through the many noise channels.…”
Section: Motivationmentioning
confidence: 99%
“…The basic architecture of ASR is inspired by the human auditory system. [7] Speech utterances of word sequence W is decided by speaker mind and delivered through his / her text generator [6]. Speaker vocal apparatus contain the signal processing component and produce the speech waveform which passes through the many noise channels.…”
Section: Motivationmentioning
confidence: 99%
“…Many methods (Lin et al, 1997;Gao et al, 2002;Klakow, 2000;Moore and Lewis, 2010;Axelrod et al, 2011) rank sentences in the generaldomain data according to their similarity to the in-domain data and select only those with score higher than some threshold. Such methods are effective and widely used.…”
Section: Related Workmentioning
confidence: 99%
“…Language modeling research has explored methods for subselecting newdomain data from a large monolingual target language corpus for use as language model training data (Lin et al, 1997;Klakow, 2000;Gao et al, 2002;Mansour et al, 2011). Translation modeling research has typically assumed that either (1) two parallel datasets are available, one in the old domain and one in the new, or (2) a large, mixed-domain parallel training corpus is available.…”
Section: Related Workmentioning
confidence: 99%