Voice-based information retrieval &amp;#x2014; how far are we from the text-based information retrieval ?

Lee, Lin-Shan; Pan, Yaodong

doi:10.1109/asru.2009.5372952

Cited by 10 publications

(4 citation statements)

References 85 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With a 20% expected annual growth rate and projected sales of more than 500 million units worldwide in 2024 (Wadhwani and Gankar, 2018), the potential influence of smart speakers recommendations is huge. However, customers cannot process voice-based information as efficiently as visual or even text-based information, mostly because of a lack of accuracy and user-system interaction (Lee and Pan, 2010), so smart speakers need to offer engaging recommendations that generate favorable attitudes toward the recommended product or service as well as purchase or visiting intentions.…”

Section: Introductionmentioning

confidence: 99%

Smart Speaker Recommendations: Impact of Gender Congruence and Amount of Information on Users' Engagement and Choice

et al. 2021

View full text Add to dashboard Cite

The relevance of smart speakers is steadily increasing, allowing users perform several daily tasks. From a commercial perspective, smart speakers also provide recommendations of products and services that may influence the consumer decision-making process. However, previous studies have mainly focused on the adoption of smart speakers, but there is a lack of proper guidelines that help design the way these devices should offer their consumption recommendations. Based on a stimulus-organism-response approach, we analyze how two features of smart speakers' recommendations (the gender congruence between the customer and the speaker, and the length of the message) influence on the effectiveness of such recommendations (i.e., visiting intentions) through its impact on user engagement and attitude. Data was collected from a sample of undergrad students in Spain using an experiment design that focused on a restaurant recommendation, and analyzed using partial least squares. On the one hand, our results suggests that gender congruence generates user engagement with the smart speaker. On the other hand, message length is positively related to attitudes towards the restaurant, at a declining rate. In addition, while better attitudes lead to higher visiting intentions, the influence of engagement on visiting intentions is partially mediated via attitudes. Thus, our findings contribute to understand the antecedents of users' engagement with smart speakers, as well as its impact on the customers' willingness to follow smart speakers' recommendations, constituting a base to analyze the impact of artificial intelligence solutions aimed to smooth the transitions of a customer through the stages of purchase process.

show abstract

Section: Introductionmentioning

confidence: 99%

Smart Speaker Recommendations: Impact of Gender Congruence and Amount of Information on Users' Engagement and Choice

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Traditionally, the spoken query detection is performed by cascading an automatic speech recognition (ASR) system with text based retrieval techniques [1], [2], [3], [4]. In this approach, the spoken queries as well as the test utterances are first converted into a sequence of words or symbols.…”

Section: Introductionmentioning

confidence: 99%

Sparse Subspace Modeling for Query by Example Spoken Term Detection

Ram

Asaei

Bourlard

2018

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Abstract-This paper focuses on the problem of query by example spoken term detection (QbE-STD) in zero-resource scenario. Current state-of-the-art approaches to tackle this problem rely on dynamic programming based template matching techniques using phone posterior features extracted at the output of a deep neural network (DNN). Previously, it has been shown that the space of phone posteriors is highly structured, as a union of low-dimensional subspaces. To exploit the temporal and sparse structure of the speech data, we investigate here three different QbE-STD systems based on sparse model recovery. More specifically, we use query examples to model the query subspace using dictionary for sparse coding. Reconstruction errors calculated using sparse representation of feature vectors are then used to characterize the underlying subspaces. The first approach uses these reconstruction errors in a dynamic programming framework to detect the spoken query, resulting in a much faster search compared to standard template matching. The other two methods aim at merging template matching and sparsity based approaches to further improve the performance. The first one proposes to regularize the template matching local distances using sparse reconstruction errors. The second approach aims at using the sparse reconstruction errors to rescore (improve) the template matching likelihood. Experiments on two different databases (AMI and MediaEval) show that the proposed hybrid systems perform better than a highly competitive QbE-STD baseline system.

show abstract

“…The objective function in (4) can be the sum of the differences between all positive and negative example pairs here, 4 With the new acoustic models to update in (6), only in (6) have to be changed without generating new lattices, so updating on-line is not computation-intensive [129].…”

Section: ) Retrieval-oriented Acoustic Modeling Under Relevancementioning

confidence: 99%

Spoken Content Retrieval—Beyond Cascading Speech Recognition with Text Retrieval

Lee

Glass

Lee

et al. 2015

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

101

View full text Add to dashboard Cite

Spoken content retrieval refers to directly indexing and retrieving spoken content based on the audio rather than text descriptions. This potentially eliminates the requirement of producing text descriptions for multimedia content for indexing and retrieval purposes, and is able to precisely locate the exact time the desired information appears in the multimedia. Spoken content retrieval has been very successfully achieved with the basic approach of cascading automatic speech recognition (ASR) with text information retrieval: after the spoken content is transcribed into text or lattice format, a text retrieval engine searches over the ASR output to find desired information. This framework works well when the ASR accuracy is relatively high, but becomes less adequate when more challenging real-world scenarios are considered, since retrieval performance depends heavily on ASR accuracy. This challenge leads to the emergence of another approach to spoken content retrieval: to go beyond the basic framework of cascading ASR with text retrieval in order to have retrieval performances that are less dependent on ASR accuracy. This overview article is intended to provide a thorough overview of the concepts, principles, approaches, and achievements of major technical contributions along this line of investigation. This includes five major directions: 1) Modified ASR for Retrieval Purposes: cascading ASR with text retrieval, but the ASR is modified or optimized for spoken content retrieval purposes; 2) Exploiting the Information not present in ASR outputs: to try to utilize the information in speech signals inevitably lost when transcribed into phonemes and words; 3) Directly Matching at the Acoustic Level without ASR: for spoken queries, the signals can be directly matched at the acoustic level, rather than at the phoneme or word levels, bypassing all ASR issues; 4) Semantic Retrieval of Spoken Content: trying to retrieve spoken content that is semantically related to the query, but not necessarily including the query terms themselves; 5) Interactive Retrieval and Efficient Presentation of the Retrieved Objects: with efficient presentation of the retrieved objects, an interactive retrieval process incorporating user actions may produce better retrieval results and user experiences.Index Terms-Spoken content retrieval, spoken term detection, query by example, semantic retrieval, joint optimization, pseudo-relevance feedback, graph-based random walk, unsupervised acoustic pattern discovery, query expansion, interactive retrieval, summarization, key term extraction.

show abstract

Voice-based information retrieval — how far are we from the text-based information retrieval ?

Cited by 10 publications

References 85 publications

Smart Speaker Recommendations: Impact of Gender Congruence and Amount of Information on Users' Engagement and Choice

Smart Speaker Recommendations: Impact of Gender Congruence and Amount of Information on Users' Engagement and Choice

Sparse Subspace Modeling for Query by Example Spoken Term Detection

Spoken Content Retrieval—Beyond Cascading Speech Recognition with Text Retrieval

Contact Info

Product

Resources

About