Dong-Min Byun, Sang-Hoon Lee, Ji-Sang Hwang, and Seong-Whan Lee
In this paper, we propose an approach for expressive singing voice generation and zero-shot singing voice synthesis (SVS). Recently, singing voice synthesis models have shown significant progress with generative models. However, previous SVS models inaccurately predict prior and fundamental frequency (F0) for unseen speakers, resulting in a low-quality generated singing voice. To address these issues, in this paper, we propose MIDI-Voice for expressive singing voice synthesis and robust zero-shot singing voice style transfer. We adopt a score-based diffusion model for singing voice synthesis and introduce a MIDI-based prior for better singing voice style adaptation. We first generate a MIDI-driven prior from the musical score, and this only includes the note information, not speaker information resulting in high-quality singing voice adaptation. Furthermore, we also propose a DDSP-based MIDI-style prior to synthesize a more expressive singing voice. The experimental results show that MIDI-Voice outperforms the previous models in synthesizing an expressive singing voice. We can observe the superiority in zero-shot singing voice style transfer performance.
Mel spectrogram by diffusion step
MIDI-driven prior | ||||||
---|---|---|---|---|---|---|
Step 0 |
Step 1 |
Step 10 |
Step 20 |
Step 30 |
Step 50 |
Step 100 |
MIDI-style prior | ||||||
Step 0 |
Step 1 |
Step 10 |
Step 20 |
Step 30 |
Step 50 |
Step 100 |
All speakers are seen during training
All speakers are unseen during training
Source Singing Voice | Target Speaker | Generated | ||
---|---|---|---|---|
Script: 언젠가 그 미소 내게 닿을 수 있나요 (Pronunciation): eonjenga geu miso naege daheul su issnayo |
||||
GT |
GT |
VISinger |
Grad-TTS |
|
MIDI-Voice (MIDI-driven) |
MIDI-Voice (MIDI-style) | |||
Script: 언젠가 그 미소 내게 닿을 수 있나요 (Pronunciation): eonjenga geu miso naege daheul su issnayo |
||||
GT |
GT |
VISinger |
Grad-TTS |
|
MIDI-Voice (MIDI-driven) |
MIDI-Voice (MIDI-style) | |||
Script: 물어봐 난 누굴까 워어어 열심히 걸어가다 보면 (Pronunciation): mureobwa nan nugunga woeoeo yeolsimhi georeogada bomyeon |
||||
GT |
GT |
VISinger |
Grad-TTS |
|
MIDI-Voice (MIDI-driven) |
MIDI-Voice (MIDI-style) |