Fully complex deep neural network for phase-incorporating monaural source separation

Lee, Yuan-Shan; Wang, Chien-Yao; Wang, Shufan; Wang, Jia-Ching; Wu, Chung-Hsien

doi:10.1109/icassp.2017.7952162

Cited by 30 publications

(11 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The deep learning model learns the mapping or masking function to retrieve the clean complex spectrum from the noisy one, and simultaneously estimates the phase and amplitude information of the speech signal. Some studies have confirmed that complex spectral features lead to better performances than (log) PS features [63,64]. The second category suggests that a raw speech waveform can be directly enhanced without transforming it into spectral features [65][66][67][68][69][70].…”

Section: Improving the Intelligibility Of Speech For Simulated Electrmentioning

confidence: 99%

Improving the Intelligibility of Speech for Simulated Electric and Acoustic Stimulation Using Fully Convolutional Neural Networks

Wang

et al. 2021

IEEE Trans. Neural Syst. Rehabil. Eng.

View full text Add to dashboard Cite

The combined electric and acoustic stimulation (EAS) has demonstrated better speech recognition than conventional cochlear implant (CI) and yielded satisfactory performance under quiet conditions. However, when noise signals are involved, both the electric signal and the acoustic signal may be distorted, thereby resulting in poor recognition performance. To suppress noise effects, speech enhancement (SE) is a necessary unit in EAS devices. Recently, a time-domain speech enhancement algorithm based on the fully convolutional neural networks (FCN) with a short-time objective intelligibility (STOI)-based objective function (termed FCN(S) in short) has received increasing attention due to its simple structure and effectiveness of restoring clean speech signals from noisy counterparts. With evidence showing the benefits of FCN(S) for normal speech, this study sets out to assess its ability to improve the intelligibility of EAS simulated speech. Objective evaluations and listening tests were conducted to examine the performance of FCN(S) in improving the speech intelligibility of normal and vocoded speech in noisy environments. The experimental results show that, compared with the traditional minimum-mean square-error SE method and the deep denoising autoencoder SE method, FCN(S) can obtain better gain in the speech intelligibility for normal as well as vocoded speech. This study, being the first to evaluate deep learning SE approaches for EAS, confirms that FCN(S) is an effective SE approach that may potentially be integrated into an EAS processor to benefit users in noisy environments.

show abstract

Section: Improving the Intelligibility Of Speech For Simulated Electrmentioning

confidence: 99%

Improving the Intelligibility of Speech for Simulated Electric and Acoustic Stimulation Using Fully Convolutional Neural Networks

Wang

et al. 2021

IEEE Trans. Neural Syst. Rehabil. Eng.

View full text Add to dashboard Cite

show abstract

“…Williamson et al proposed a twinhead DNN to infer both real and imaginary parts of the target spectrogram [24]. Several authors attempted to construct a fully complex-valued network by updating parameters based on complex back propagation [25,26]. However, to achieve good performance, the network needs to be constrained by sparsity.…”

Section: * Indicates Equal Contributionmentioning

confidence: 99%

PhaseNet: Discretized Phase Modeling with Deep Neural Networks for Audio Source Separation

et al. 2018

View full text Add to dashboard Cite

Previous research on audio source separation based on deep neural networks (DNNs) mainly focuses on estimating the magnitude spectrum of target sources and typically, phase of the mixture signal is combined with the estimated magnitude spectra in an ad-hoc way. Although recovering target phase is assumed to be important for the improvement of separation quality, it can be difficult to handle the periodic nature of the phase with the regression approach. Unwrapping phase is one way to eliminate the phase discontinuity, however, it increases the range of value along with the times of unwrapping, making it difficult for DNNs to model. To overcome this difficulty, we propose to treat the phase estimation problem as a classification problem by discretizing phase values and assigning class indices to them. Experimental results show that our classificationbased approach 1) successfully recovers the phase of the target source in the discretized domain, 2) improves signal-todistortion ratio (SDR) over the regression-based approach in both speech enhancement task and music source separation (MSS) task, and 3) outperforms state-of-the-art MSS.

show abstract

“…In [14], methods such as Wiener filter and iterative procedure that incorporate phase constraints are discussed in singing voice separation systems. Lee et al [15] estimate the complex-valued STFT of music sources by a complex-valued deep neural network. PhaseNet [16] handles phase estimation as a classification problem.…”

Section: Introductionmentioning

confidence: 99%

Complex Ratio Masking For Singing Voice Separation

Zhang

Liu

Wang

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Music source separation is important for applications such as karaoke and remixing. Much of previous research focuses on estimating short-time Fourier transform (STFT) magnitude and discarding phase information. We observe that, for singing voice separation, phase can make considerable improvement in separation quality. This paper proposes a complex ratio masking method for voice and accompaniment separation. The proposed method employs DenseUNet with self attention to estimate the real and imaginary components of STFT for each sound source. A simple ensemble technique is introduced to further improve separation performance. Evaluation results demonstrate that the proposed method outperforms recent state-of-the-art models for both separated voice and accompaniment.

show abstract

Fully complex deep neural network for phase-incorporating monaural source separation

Cited by 30 publications

References 21 publications

Improving the Intelligibility of Speech for Simulated Electric and Acoustic Stimulation Using Fully Convolutional Neural Networks

Improving the Intelligibility of Speech for Simulated Electric and Acoustic Stimulation Using Fully Convolutional Neural Networks

PhaseNet: Discretized Phase Modeling with Deep Neural Networks for Audio Source Separation

Complex Ratio Masking For Singing Voice Separation

Contact Info

Product

Resources

About