A discriminative HMM/N-gram-based retrieval approach for mandarin spoken documents

Wang

IEICE Trans. Inf. & Syst.

2012

Self Cite

SUMMARYThis paper describes the application of two attractive categories of topic modeling techniques to the problem of spoken document retrieval (SDR), viz. document topic model (DTM) and word topic model (WTM). Apart from using the conventional unsupervised training strategy, we explore a supervised training strategy for estimating these topic models, imagining a scenario that user query logs along with click-through information of relevant documents can be utilized to build an SDR system. This attempt has the potential to associate relevant documents with queries even if they do not share any of the query words, thereby improving on retrieval quality over the baseline system. Likewise, we also study a novel use of pseudo-supervised training to associate relevant documents with queries through a pseudo-feedback procedure. Moreover, in order to lessen SDR performance degradation caused by imperfect speech recognition, we investigate leveraging different levels of index features for topic modeling, including words, syllable-level units, and their combination. We provide a series of experiments conducted on the TDT (TDT-2 and TDT-3) Chinese SDR collections. The empirical results show that the methods deduced from our proposed modeling framework are very effective when compared with a few existing retrieval approaches. key words : spoken document retrieval, topic model, supervised training, pseudo-supervised training, subword-level indexing

Section: Methodsmentioning

confidence: 99%

Section: Subword-level Index Unitsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Spoken Document Retrieval Leveraging Unsupervised and Supervised Topic Modeling Techniques

Wang

IEICE Trans. Inf. & Syst.

2012

Self Cite

“…This method was further improved upon in Song and Croft [1999]. Chen et al [2004] applied Song and Croft's method to Mandarin SDR using 1-best ASR transcripts. In this task, it was also shown to outperform tf · idf (with logarithmically adjusted document and query term frequencies).…”

Section: Retrieval Via Statistical Language Modelingmentioning

confidence: 99%

Statistical lattice-based spoken document retrieval

Chia

Sim

et al. 2010

ACM Trans. Inf. Syst.

Recent research efforts on spoken document retrieval have tried to overcome the low quality of 1-best automatic speech recognition transcripts, especially in the case of conversational speech, by using statistics derived from speech lattices containing multiple transcription hypotheses as output by a speech recognizer. We present a method for lattice-based spoken document retrieval based on a statistical n-gram modeling approach to information retrieval. In this statistical lattice-based retrieval (SLBR) method, a smoothed statistical model is estimated for each document from the expected counts of words given the information in a lattice, and the relevance of each document to a query is measured as a probability under such a model. We investigate the efficacy of our method under various parameter settings of the speech recognition and lattice processing engines, using the Fisher English Corpus of conversational telephone speech. Experimental results show that our method consistently achieves better retrieval performance than using only the 1-best transcripts in statistical retrieval, outperforms a recently proposed lattice-based vector space retrieval method, and also compares favorably with a lattice-based retrieval method based on the Okapi BM25 model. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.

“…For example, the n-gram modeling (especially the bigram and trigram modeling) approach, which determines the probability of a word given the preceding n-1 word history, is most prominently used [Jelinek and Mercer 1980;Rosenfeld 2000;Bellegarda 2004]. This statistical paradigm was first introduced for the information retrieval (IR) problems by Ponte and Croft [1998], Song and Croft [1999], and Miller et al [1999], indicating very good potential, and was then extended in a number of publications [Berger and Lafferty 1999;Hoffmann 1999;Lafferty and Zhai 2001;Chen et al 2004b]. In these approaches, the relevance measure between a query Q and a document D is expressed as P (D |Q ); that is, the probability that D is relevant given that the query Q is posed.…”

Section: Introductionmentioning

confidence: 99%

Word Topic Models for Spoken Document Retrieval and Transcription

ACM Transactions on Asian Language Information Processing

2009

Self Cite

Statistical language modeling (LM), which aims to capture the regularities in human natural language and quantify the acceptability of a given word sequence, has long been an interesting yet challenging research topic in the speech and language processing community. It also has been introduced to information retrieval (IR) problems, and provided an effective and theoretically attractive probabilistic framework for building IR systems. In this article, we propose a word topic model (WTM) to explore the co-occurrence relationship between words, as well as the long-span latent topical information, for language modeling in spoken document retrieval and transcription. The document or the search history as a whole is modeled as a composite WTM model for generating a newly observed word. The underlying characteristics and different kinds of model structures are extensively investigated, while the performance of WTM is thoroughly analyzed and verified by comparison with the well-known probabilistic latent semantic analysis (PLSA) model as well as the other models. The IR experiments are performed on the TDT Chinese collections (TDT-2 and TDT-3), while the large vocabulary continuous speech recognition (LVCSR) experiments are conducted on the Mandarin broadcast news collected in Taiwan. Experimental results seem to indicate that WTM is a promising alternative to the existing models.