Dong-Min Byun, Seung-Bin Kim, and Seong-Whan Lee
Singing voice synthesis systems have significantly advanced; however achieving high-quality singing voices in zero-shot tasks remains challenging. Traditional singing voice synthesis models face challenges in predicting the fundamental frequency (F0) of unseen speakers. In this study, we propose MIDI-Voice 2, which uses MIDI-driven priors to achieve high-quality singing voice synthesis, even in zero-shot tasks. We introduce a diffusion-based singing voice synthesis model that operates without F0. MIDI-Voice 2 consists of two diffusion models: a prior generator and a singing voice generator. The prior generator uses MIDI-driven priors, including accurate melody, to generate MIDI-style priors, and the singing voice generator uses these MIDI-style priors along with content and timbre information to generate singing voices. Disentangling the melody from timbre allows speaker adaptation without predicting the F0 of an unseen speaker. Additionally, we use a transformer-based diffusion to generate higher-quality audio in zero-shot tasks. Our experiments demonstrate that MIDI-Voice 2 improves speaker-adaptation without F0 and produces high-quality audio in zero-shot tasks.
All speakers are seen during inference
Source Singing Voice | Target Speaker | Generted | ||
---|---|---|---|---|
Script: 마음이 전부 떠난대도 (Pronunciation): maeumi jeonbu tteonandaedo |
||||
GT |
GT |
VISinger |
Grad-TTS |
|
MIDI-Voice |
MIDI-Voice 2 | |||
Script: 본다 내 숨결 같은 (Pronunciation): bonda nae sumgyeol gateun |
||||
GT |
GT |
VISinger |
Grad-TTS |
|
MIDI-Voice |
MIDI-Voice 2 | |||
Script: 우릴 쉽게 보지 못해 (Pronunciation): uril swipge boji moshae |
||||
GT |
GT |
VISinger |
Grad-TTS |
|
MIDI-Voice |
MIDI-Voice 2 |
All speakers are unseen during inference
Source Singing Voice | Target Speaker | Generted | ||
---|---|---|---|---|
Script: 사랑은 (Pronunciation): sarangeun |
||||
GT |
GT |
VISinger |
Grad-TTS |
|
MIDI-Voice |
MIDI-Voice 2 | |||
Script: 혼자 살다 혼자 가는거죠 (Pronunciation): honja salda honja ganeungeojyo |
||||
GT |
GT |
VISinger |
Grad-TTS |
|
MIDI-Voice |
MIDI-Voice 2 | |||
Script: 그대와 나의 지난날 (Pronunciation): geudaewa naui jinannal |
||||
GT |
GT |
VISinger |
Grad-TTS |
|
MIDI-Voice |
MIDI-Voice 2 | |||
Script: 너를 구해줄게 (Pronunciation): neoreul guhaejulge |
||||
GT |
GT |
VISinger |
Grad-TTS |
|
MIDI-Voice |
MIDI-Voice 2 |
All speakers are seen during inference
Source Singing Voice | Generted | |||
---|---|---|---|---|
Script: 하늘도 볼거야 (Pronunciation): haneuldo bolgeoya |
||||
GT |
WaveNet |
U-net |
||
Transformer-base |
Transformer-large | |||
Script: 남들 앞에선 괜찮다고 (Pronunciation): namdeul apeseon gwaenchanhdago |
||||
GT |
WaveNet |
U-net |
||
Transformer-base |
Transformer-large | |||
Script: 그렇게 내일을 살고 싶어 (Pronunciation): geureohge naeireul salgo sipeo |
||||
GT |
WaveNet |
U-net |
||
Transformer-base |
Transformer-large | |||
Script: 소리치며 달려 보아도 (Pronunciation): sorichimyeo dallyeo boado |
||||
GT |
WaveNet |
U-net |
||
Transformer-base |
Transformer-large | |||