MIDI-Voice Demo

Abstract

Singing voice synthesis systems have significantly advanced; however achieving high-quality singing voices in zero-shot tasks remains challenging. Traditional singing voice synthesis models face challenges in predicting the fundamental frequency (F0) of unseen speakers. In this study, we propose MIDI-Voice 2, which uses MIDI-driven priors to achieve high-quality singing voice synthesis, even in zero-shot tasks. We introduce a diffusion-based singing voice synthesis model that operates without F0. MIDI-Voice 2 consists of two diffusion models: a prior generator and a singing voice generator. The prior generator uses MIDI-driven priors, including accurate melody, to generate MIDI-style priors, and the singing voice generator uses these MIDI-style priors along with content and timbre information to generate singing voices. Disentangling the melody from timbre allows speaker adaptation without predicting the F0 of an unseen speaker. Additionally, we use a transformer-based diffusion to generate higher-quality audio in zero-shot tasks. Our experiments demonstrate that MIDI-Voice 2 improves speaker-adaptation without F0 and produces high-quality audio in zero-shot tasks.

Seen Speaker Singing Voice Synthesis

All speakers are seen during inference

Source Singing Voice	Target Speaker	Generted
Script: 마음이 전부 떠난대도 (Pronunciation): maeumi jeonbu tteonandaedo
GT	GT	VISinger	Grad-TTS
		MIDI-Voice	MIDI-Voice 2
		Script: 본다 내 숨결 같은 (Pronunciation): bonda nae sumgyeol gateun
		GT	GT	VISinger	Grad-TTS
				MIDI-Voice	MIDI-Voice 2
				Script: 우릴 쉽게 보지 못해 (Pronunciation): uril swipge boji moshae
GT	GT			VISinger	Grad-TTS
				MIDI-Voice	MIDI-Voice 2

Zero-shot Singing Voice Style transfer

All speakers are unseen during inference

Source Singing Voice	Target Speaker	Generted
Script: 사랑은 (Pronunciation): sarangeun
GT	GT	VISinger	Grad-TTS
		MIDI-Voice	MIDI-Voice 2
		Script: 혼자 살다 혼자 가는거죠 (Pronunciation): honja salda honja ganeungeojyo
		GT	GT	VISinger	Grad-TTS
				MIDI-Voice	MIDI-Voice 2
				Script: 그대와 나의 지난날 (Pronunciation): geudaewa naui jinannal
GT	GT			VISinger	Grad-TTS
				MIDI-Voice	MIDI-Voice 2
				Script: 너를 구해줄게 (Pronunciation): neoreul guhaejulge
		GT	GT	VISinger	Grad-TTS
				MIDI-Voice	MIDI-Voice 2

Diffusion Backbone Model

All speakers are seen during inference

Source Singing Voice	Generted
Script: 하늘도 볼거야 (Pronunciation): haneuldo bolgeoya
GT	WaveNet	U-net
	Transformer-base	Transformer-large
	Script: 남들 앞에선 괜찮다고 (Pronunciation): namdeul apeseon gwaenchanhdago
	GT	WaveNet	U-net
		Transformer-base	Transformer-large
		Script: 그렇게 내일을 살고 싶어 (Pronunciation): geureohge naeireul salgo sipeo
GT		WaveNet	U-net
		Transformer-base	Transformer-large

	Script: 소리치며 달려 보아도 (Pronunciation): sorichimyeo dallyeo boado
	GT	WaveNet	U-net
		Transformer-base	Transformer-large