Hierarchical Diffusion Model for Zero-Shot Singing Voice Synthesis with MIDI Priors

 

Dong-Min Byun, Seung-Bin Kim, and Seong-Whan Lee

Abstract

Singing voice synthesis systems have significantly advanced; however achieving high-quality singing voices in zero-shot tasks remains challenging. Traditional singing voice synthesis models face challenges in predicting the fundamental frequency (F0) of unseen speakers. In this study, we propose MIDI-Voice 2, which uses MIDI-driven priors to achieve high-quality singing voice synthesis, even in zero-shot tasks. We introduce a diffusion-based singing voice synthesis model that operates without F0. MIDI-Voice 2 consists of two diffusion models: a prior generator and a singing voice generator. The prior generator uses MIDI-driven priors, including accurate melody, to generate MIDI-style priors, and the singing voice generator uses these MIDI-style priors along with content and timbre information to generate singing voices. Disentangling the melody from timbre allows speaker adaptation without predicting the F0 of an unseen speaker. Additionally, we use a transformer-based diffusion to generate higher-quality audio in zero-shot tasks. Our experiments demonstrate that MIDI-Voice 2 improves speaker-adaptation without F0 and produces high-quality audio in zero-shot tasks.


Seen Speaker Singing Voice Synthesis

All speakers are seen during inference

Source Singing Voice Target Speaker Generted
Script: 마음이 전부 떠난대도
(Pronunciation): maeumi jeonbu tteonandaedo

GT

GT

VISinger

Grad-TTS

MIDI-Voice

MIDI-Voice 2

Script: 본다 내 숨결 같은
(Pronunciation): bonda nae sumgyeol gateun

GT

GT

VISinger

Grad-TTS

MIDI-Voice

MIDI-Voice 2

Script: 우릴 쉽게 보지 못해
(Pronunciation): uril swipge boji moshae

GT

GT

VISinger

Grad-TTS

MIDI-Voice

MIDI-Voice 2

Zero-shot Singing Voice Style transfer

All speakers are unseen during inference

Source Singing Voice Target Speaker Generted
Script: 사랑은
(Pronunciation): sarangeun

GT

GT

VISinger

Grad-TTS

MIDI-Voice

MIDI-Voice 2

Script: 혼자 살다 혼자 가는거죠
(Pronunciation): honja salda honja ganeungeojyo

GT

GT

VISinger

Grad-TTS

MIDI-Voice

MIDI-Voice 2

Script: 그대와 나의 지난날
(Pronunciation): geudaewa naui jinannal

GT

GT

VISinger

Grad-TTS

MIDI-Voice

MIDI-Voice 2

Script: 너를 구해줄게
(Pronunciation): neoreul guhaejulge

GT

GT

VISinger

Grad-TTS

MIDI-Voice

MIDI-Voice 2

Diffusion Backbone Model

All speakers are seen during inference

Source Singing Voice Generted
Script: 하늘도 볼거야
(Pronunciation): haneuldo bolgeoya

GT

WaveNet

U-net

Transformer-base

Transformer-large

Script: 남들 앞에선 괜찮다고
(Pronunciation): namdeul apeseon gwaenchanhdago

GT

WaveNet

U-net

Transformer-base

Transformer-large

Script: 그렇게 내일을 살고 싶어
(Pronunciation): geureohge naeireul salgo sipeo

GT

WaveNet

U-net

Transformer-base

Transformer-large

Script: 소리치며 달려 보아도
(Pronunciation): sorichimyeo dallyeo boado

GT

WaveNet

U-net

Transformer-base

Transformer-large