Bone- and air-conduction speech combination method for speaker recognition

Tsuge, Shin; Kuroiwa, Shingo

doi:10.1504/ijbm.2019.096565

Cited by 7 publications

(4 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In order to increase the performance of speaker recognition in adverse conditions such as noise, multimodality has been explored by numerous researchers. Alternate supplementary information such as lip reading 28 , speech recorded with non-invasive sensors like throat microphone 29 and bone conduction microphone 30 have been shown to provide large gains in performance in adverse noisy conditions. Information about the speaker is present in all audio modes of speech, whether conducted through air, bone, or skin.…”

Section: Multimodal Systems For Speaker Modelingmentioning

confidence: 99%

“…A late integration with standard microphone signals resulted in improved performance of 95.8% accuracy. Other researchers such as [33][34][35][36] too have explored throat microphone, bone conduction microphone GEMS EGG, and non-audible murmur microphone signals' combination with standard speech for improving the speaker modeling. Linear features such as LPCC, MFCC, and i-vectors were used in all these works.…”

Section: Multimodal Systems For Speaker Modelingmentioning

confidence: 99%

See 1 more Smart Citation

Recurrence plot embeddings as short segment nonlinear features for multimodal speaker identification using air, bone and throat microphones

Nawas,

Shahina,

Balachandar

et al. 2024

Sci Rep

View full text Add to dashboard Cite

Speech is produced by a nonlinear, dynamical Vocal Tract (VT) system, and is transmitted through multiple (air, bone and skin conduction) modes, as captured by the air, bone and throat microphones respectively. Speaker specific characteristics that capture this nonlinearity are rarely used as stand-alone features for speaker modeling, and at best have been used in tandem with well known linear spectral features to produce tangible results. This paper proposes Recurrent Plot (RP) embeddings as stand-alone, non-linear speaker-discriminating features. Two datasets, the continuous multimodal TIMIT speech corpus and the consonant-vowel unimodal syllable dataset, are used in this study for conducting closed-set speaker identification experiments. Experiments with unimodal speaker recognition systems show that RP embeddings capture the nonlinear dynamics of the VT system which are unique to every speaker, in all the modes of speech. The Air (A), Bone (B) and Throat (T) microphone systems, trained purely on RP embeddings perform with an accuracy of 95.81%, 98.18% and 99.74%, respectively. Experiments using the joint feature space of combined RP embeddings for bimodal (A–T, A–B, B–T) and trimodal (A–B–T) systems show that the best trimodal system (99.84% accuracy) performs on par with trimodal systems using spectrogram (99.45%) and MFCC (99.98%). The 98.84% performance of the B–T bimodal system shows the efficacy of a speaker recognition system based entirely on alternate (bone and throat) speech, in the absence of the standard (air) speech. The results underscore the significance of the RP embedding, as a nonlinear feature representation of the dynamical VT system that can act independently for speaker recognition. It is envisaged that speech recognition too will benefit from this nonlinear feature.

show abstract

Section: Multimodal Systems For Speaker Modelingmentioning

confidence: 99%

Section: Multimodal Systems For Speaker Modelingmentioning

confidence: 99%

Recurrence plot embeddings as short segment nonlinear features for multimodal speaker identification using air, bone and throat microphones

Nawas,

Shahina,

Balachandar

et al. 2024

Sci Rep

View full text Add to dashboard Cite

show abstract

“…In terms of modality for identification, the multimodal approach has been recently applied to SID problem to enhance the SID system's robustness, in which besides air-conducted speech, the complementary sources such as throat microphone [22], bone conduction microphone [23,24], microphone array [25,26], and video [26] are added to the SID system to further improve the system's accuracy.…”

Section: Related Workmentioning

confidence: 99%

Speaker Identification in Multi-Talker Overlapping Speech Using Neural Networks

Tran

Tsai

2020

IEEE Access

View full text Add to dashboard Cite

Although numerous works have studied the problem of automatic speaker identification (SID), there are only few works on the SID for overlapping speech, and none of them consider the case of more than two simultaneous speakers. Recognizing that overlapping speech occurs frequently in real-life scenarios, such as in meetings or debates, this work investigates the methods for overlapping SID (OSID) that can determine identities in the overlapping speech from up to five simultaneous speakers. We propose two deep-learning OSID systems, one is two-stage and the other is single-stage. The two-stage system determines the number of simultaneous speakers firstly, followed by identifying the speaker(s). The single-stage system uses a single classifier to perform OSID directly, which is slightly more computationally efficient than the two-stage system. Our experiments show that the two-stage OSID system achieves better identification accuracy than that of the single-stage system. In addition, both the OSID systems based on one-dimensional convolutional neural networks (1DCNN) perform better than the systems based on multilayer perceptron (MLP) and Gaussian mixture models (GMMs). The proposed 1DCNN-based two-stage OSID system achieves 98.55% OSID accuracy for the clean audio data containing up to five simultaneous speakers. In more challenging experimental conditions involving both background noises and high overlapping energy ratios, the system still attained accuracies of above 90%.INDEX TERMS overlapping speech, speaker identification, simultaneous speakers, neural networks, Gaussian mixture models.

show abstract

“…Pitch detection for BC speech is discussed in [2]. In [3], BC speech was utilized with AC speech for speaker recognition. Speaker verification is also described in [4].…”

Section: Introductionmentioning

confidence: 99%

Packet Loss Concealment Estimating Residual Errors of Forward-Backward Linear Prediction for Bone-Conducted Speech

Yasui,

Sugiura

et al. 2024

IJACSA

View full text Add to dashboard Cite

This study proposes a suitable model for packet loss concealment (PLC) by estimating the residual error of the linear prediction (LP) method for bone-conducted (BC) speech. Instead of conventional LP-based PLC techniques where the residual error is ignored, we employ forward-backward linear prediction (FBLP), known as the modified covariance (MC) method, by incorporating the residual error estimates. The MC method provides precise LP estimation for a short data length, reduces the numerical difficulties, and produces a stable model, whereas the conventional autocorrelation (ACR) method of LP suffers from numerical problems. The MC method has the effect of compressing the spectral dynamic range of the BC speech, which improves the numerical difficulties. Simulation results reveal that the proposed method provides excellent outcomes from some objective evaluation scores in contrast to conventional PLC techniques.

show abstract

Bone- and air-conduction speech combination method for speaker recognition

Cited by 7 publications

References 15 publications

Recurrence plot embeddings as short segment nonlinear features for multimodal speaker identification using air, bone and throat microphones

Recurrence plot embeddings as short segment nonlinear features for multimodal speaker identification using air, bone and throat microphones

Speaker Identification in Multi-Talker Overlapping Speech Using Neural Networks

Packet Loss Concealment Estimating Residual Errors of Forward-Backward Linear Prediction for Bone-Conducted Speech

Contact Info

Product

Resources

About