With the advent of deep learning, a huge number of text-to-speech (TTS) models which produce human-like speech have emerged. Recently, by introducing syntactic and semantic information w.r.t the input text, various approaches have been proposed to enrich the naturalness and expressiveness of TTS models. Although these strategies showed impressive results, they still have some limitations in utilizing language information. First, most approaches only use graph networks to utilize syntactic and semantic information without considering linguistic features. Second, most previous works do not explicitly consider adjacent words when encoding syntactic and semantic information, even though it is obvious that adjacent words are usually meaningful when encoding the current word. To address these issues, we propose Relation-aware Word Encoding Network (RWEN), which effectively allows syntactic and semantic information based on two modules (i.e., Semantic-level Relation Encoding and Adjacent Word Relation Encoding). Experimental results show substantial improvements compared to previous works.
In this paper, we propose on-device voice command assistants for mobile games to increase user experiences even in hands-busy situations such as driving and cooking. Since most of the current mobile games cost large memory (e.g. more than 1GB memory), so it is necessary to reduce memory usage further to integrate voice commands systems on mobile clients. Therefore a need to design an on-device automatic speech recognition system that costs minimal memory and CPU resources rises. To this end, we apply cross layer parameter sharing to Conformer, named MONICA2 which results in lower memory usage for on-device speech recognition. MONICA2 reduces the number of parameters of deep neural network by 58%, with minimal recognition accuracy degradation measured in word error rate on Librispeech benchmark. As an on-device voice command user interface, MONICA2 costs only 12.8MB mobile memory and the average inference time for 3-seconds voice command is about 30ms, which is profiled in Samsung Galaxy S9. As far as we know, MONICA2 is the most memory efficient yet accurate on-device speech recognition which could be applied to various applications such as mobile games, IoT devices, etc.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.