Wireless earphones are pervasive acoustic sensing platforms that can be used for many applications such as motion tracking and handwriting input. However, wireless earphones suffer clock offset between the connected smart devices, which would accumulate error rapidly over time. Moreover, compared with smartphone and voice assistants, the acoustic signal transmitted by wireless earphone is much weaker due to the poor frequency response. In this paper, we propose MagSound, which uses the built-in magnets to improve the tracking and acoustic sensing performance of Commercial-Off-The-Shelf (COTS) earphones. Leveraging magnetic field strength, MagSound can predict the position of wireless earphones free from clock offset, which can be used to re-calibrate the acoustic tracking. Further, the fusion of the two modalities mitigates the accumulated clock offset and multipath effect. Besides, to increase the robustness to noise, MagSound employs finely designed Orthogonal Frequency-Division Multiplexing (OFDM) ranging signals. We implement a prototype of MagSound on COTS and perform experiments for tracking and handwriting input. Results demonstrate that MagSound maintains millimeter-level error in 2D tracking, and improves the handwriting recognition accuracy by 49.81%. We believe that MagSound can contribute to practical applications of wireless earphones-based sensing.
Silent speech recognition (SSR) allows users to speak to the device without making a sound, avoiding being overheard or disturbing others. Compared to the video-based approach, wireless signal-based SSR can work when the user is wearing a mask and has fewer privacy concerns. However, previous wireless-based systems are still far from well-studied, e.g. they are only evaluated in corpus with highly limited size, making them only feasible for interaction with dozens of deterministic commands. In this paper, we present mSilent, a millimeter-wave (mmWave) based SSR system that can work in the general corpus containing thousands of daily conversation sentences. With the strong recognition capability, mSilent not only supports the more complex interaction with assistants, but also enables more general applications in daily life such as communication and input. To extract fine-grained articulatory features, we build a signal processing pipeline that uses a clustering-selection algorithm to separate articulatory gestures and generates a multi-scale detrended spectrogram (MSDS). To handle the complexity of the general corpus, we design an end-to-end deep neural network that consists of a multi-branch convolutional front-end and a Transformer-based sequence-to-sequence back-end. We collect a general corpus dataset of 1,000 daily conversation sentences that contains 21K samples of bi-modality data (mmWave and video). Our evaluation shows that mSilent achieves a 9.5% average word error rate (WER) at a distance of 1.5m, which is comparable to the performance of the state-of-the-art video-based approach. We also explore deploying mSilent in two typical scenarios of text entry and in-car assistant, and the less than 6% average WER demonstrates the potential of mSilent in general daily applications.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.