A 5-fold cross-validation evaluation of our IMC approach on a representative sequence set yielded a mean correlation coefficient of 0.84 (promoter versus coding sequences) and 0.53 (promoter versus non-coding sequences). Applied to the task of eukaryotic promoter region identification in genomic DNA sequences, our classifier identifies 50% of the promoter regions in the sequences used in the most recent review and comparison by Fickett and Hatzigeorgiou ( Genome Res., 7, 861-878, 1997), while having a false-positive rate of 1/849 bp.
This paper presents automatic methods for the segmentation and classication of dialog acts (DA). In Verbmobil it is often sucient to recognize the sequence of DAs occurring during a dialog between the two partners. Since a turn can consist of one or more successive D As we conduct the classication of DAs in a two step procedure: First each turn has to be segmented into units which correspond to a DA and second the DA categories have to be identied. For the segmentation we use polygrams and multi{layer perceptrons, using prosodic features. The classication of DAs is done with semantic classication trees and polygrams.
We present a new statistical approach for eukaryotic polymerase II promoter recognition. We apply stochastic segment models in which each state represents a functional part of the promoter. The segments are trained in an unsupervised way. We compare segment models with three and ve states with our previous system which modeled the promoters as a whole, i. e. as a single state. Results on the classi cation of a representative collection of human and D. melanogaster promoter and non-promoter sequences show great improvements. The practical importance is demonstrated on the mining of large contiguous sequences.
Summary. We present two concepts for systems with language identification in the context of multilingual information retrieval dialogs. The first one has an explicit module for language identification. It is based on training a common codebook for all the languages and integrating over the output probabilities of language specific -gram models trained over the codebook sequences. The system can decide for one language either after a predefined time interval or if the difference between the probabilities of the languages succeeds a certain threshold. This approach allows to recognize languages that the system can not process and give out a prerecorded message in that language. In the second approach, the trained recognizers of the languages to be recognized, the lexicons, and the language models are combined to one multilingual recognizer. Only allowing transitions between the words from one language, each hypothesized word chain contains words from just one language and language identification is an implicit by-product of the speech recognizer. First results for both language identification approaches are presented.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.