MIDI-Voice Demo

Abstract

In this paper, we propose an approach for expressive singing voice generation and zero-shot singing voice synthesis (SVS). Recently, singing voice synthesis models have shown significant progress with generative models. However, previous SVS models inaccurately predict prior and fundamental frequency (F0) for unseen speakers, resulting in a low-quality generated singing voice. To address these issues, in this paper, we propose MIDI-Voice for expressive singing voice synthesis and robust zero-shot singing voice style transfer. We adopt a score-based diffusion model for singing voice synthesis and introduce a MIDI-based prior for better singing voice style adaptation. We first generate a MIDI-driven prior from the musical score, and this only includes the note information, not speaker information resulting in high-quality singing voice adaptation. Furthermore, we also propose a DDSP-based MIDI-style prior to synthesize a more expressive singing voice. The experimental results show that MIDI-Voice outperforms the previous models in synthesizing an expressive singing voice. We can observe the superiority in zero-shot singing voice style transfer performance.

MIDI-Voice

Mel spectrogram by diffusion step

MIDI-driven prior
Step 0	Step 1	Step 10	Step 20	Step 30	Step 50	Step 100
							MIDI-style prior
							Step 0	Step 1	Step 10	Step 20	Step 30	Step 50	Step 100

<

Seen Speaker Singing Voice Synthesis

All speakers are seen during training

Zero-shot Singing Voice Style transfer

All speakers are unseen during training

Source Singing Voice	Target Speaker	Generated
Script: 언젠가 그 미소 내게 닿을 수 있나요 (Pronunciation): eonjenga geu miso naege daheul su issnayo
GT spk90	GT s10_m	VISinger	Grad-TTS
		MIDI-Voice (MIDI-driven)	MIDI-Voice (MIDI-style)
		Script: 언젠가 그 미소 내게 닿을 수 있나요 (Pronunciation): eonjenga geu miso naege daheul su issnayo
		GT spk90	GT s05_f	VISinger	Grad-TTS
				MIDI-Voice (MIDI-driven)	MIDI-Voice (MIDI-style)
				Script: 물어봐 난 누굴까 워어어 열심히 걸어가다 보면 (Pronunciation): mureobwa nan nugunga woeoeo yeolsimhi georeogada bomyeon
GT spk81	GT s02_m			VISinger	Grad-TTS
				MIDI-Voice (MIDI-driven)	MIDI-Voice (MIDI-style)