We present an information retrieval system that simultaneously allows to search for text and speech documents. The retrieval system accepts vague queries and performs a best-match search to find those documents that are relevant to the query. The output of the retrieval system is a list of ranked documents where the documents on the top of the list satisfy best the user's information need. The relevance of the documents is estimated by means of metadata (document description vectors). The metadata is automatically generated and it is organized such that queries can be processed efficiently. We introduce a controlled indexing vocabulary for both speech and text documents. The size of the new indexing vocabulary is small (1000 features) compared with the sizes of indexing vocabularies of conventional text retrieval (10000 -100000 features). We show that the retrieval effectiveness based on such a small indexing vocabulary is similar to the retrieval effectiveness of a Boolean retrieval system.
We show how the recognition performance of a speech recognition component in a speech retrieval system affects the retrieval effectiveness. A speech retrieval system facilitates content-based retrieval of speech documents, i.e. audio recordings containing spoken text. The speech retrieval process receives queries from users and for every query it ranks the speech documents in decreasing order of their probabilities that they are relevant to the query. The speech recognition component is an important part of a speech retrieval system, since it detects the occurrences of indexing features in the documents. Because the recognition of indexing features in continuous speech is error prone, the question arises how much an error prone recognition of indexing features affects the retrieval effectiveness. As an answer to this question and main contribution of this paper we simulated the recognition of indexing features in speech documents on standard information retrieval test collections and show the resulting retrieval accuracies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.