ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053047
|View full text |Cite
|
Sign up to set email alerts
|

Transferring Neural Speech Waveform Synthesizers to Musical Instrument Sounds Generation

Abstract: Recent neural waveform synthesizers such as WaveNet, WaveGlow, and the neural-source-filter (NSF) model have shown good performance in speech synthesis despite their different methods of waveform generation. The similarity between speech and music audio synthesis techniques suggests interesting avenues to explore in terms of the best way to apply speech synthesizers in the music domain. This work compares three neural synthesizers used for musical instrument sounds generation under three scenarios: training fr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2
1

Relationship

3
5

Authors

Journals

citations
Cited by 11 publications
(4 citation statements)
references
References 14 publications
(24 reference statements)
0
4
0
Order By: Relevance
“…The DDSP autoencoder [1] falls into the first category as it generates control signals for a spectral modelling synthesiser [25]. The neural source-filter (NSF) approach [3,24,26] is in the second category. It learns a nonlinear filter that transforms a sinusoidal exciter to a target signal, guided by a control embedding generated by a separate encoder.…”
Section: Neural Audio Synthesismentioning
confidence: 99%
“…The DDSP autoencoder [1] falls into the first category as it generates control signals for a spectral modelling synthesiser [25]. The neural source-filter (NSF) approach [3,24,26] is in the second category. It learns a nonlinear filter that transforms a sinusoidal exciter to a target signal, guided by a control embedding generated by a separate encoder.…”
Section: Neural Audio Synthesismentioning
confidence: 99%
“…In addition to the choice of data for the base model, the training strategy is also important and can make a difference to the quality of the output. For instance, [14] found that for copy synthesis of music, starting with vocoder models that had originally been trained on speech data and then finetuning using music data worked better even than training only on music data from scratch, indicating that useful additional information for music signals can be learned from speech. For multi-speaker text-to-speech synthesis, [7] found that warmstarting the multi-speaker model from a high-quality singlespeaker pretrained model can greatly reduce training time as compared to training a multi-speaker model from scratch, and that the warm-starting approach also provides the benefits of a larger vocabulary.…”
Section: Introductionmentioning
confidence: 99%
“…In our previous works on TTS, we used sine-based source signals because their periodicity can be accurately maintained in the generated voiced sounds [13]. In addition to the speech waveforms, the sine-based waveform also helps the NSF models to produce high-quality music signals for woodwind, string, and brass instruments [15].…”
Section: Introductionmentioning
confidence: 99%