MIDI-Voice: Expressive Zero-shot Singing Voice Synthesis via MIDI-driven Priors

 

Dong-Min Byun, Sang-Hoon Lee, Ji-Sang Hwang, and Seong-Whan Lee

Abstract

In this paper, we propose an approach for expressive singing voice generation and zero-shot singing voice synthesis (SVS). Recently, singing voice synthesis models have shown significant progress with generative models. However, previous SVS models inaccurately predict prior and fundamental frequency (F0) for unseen speakers, resulting in a low-quality generated singing voice. To address these issues, in this paper, we propose MIDI-Voice for expressive singing voice synthesis and robust zero-shot singing voice style transfer. We adopt a score-based diffusion model for singing voice synthesis and introduce a MIDI-based prior for better singing voice style adaptation. We first generate a MIDI-driven prior from the musical score, and this only includes the note information, not speaker information resulting in high-quality singing voice adaptation. Furthermore, we also propose a DDSP-based MIDI-style prior to synthesize a more expressive singing voice. The experimental results show that MIDI-Voice outperforms the previous models in synthesizing an expressive singing voice. We can observe the superiority in zero-shot singing voice style transfer performance.


MIDI-Voice


MIDI-Voice

Mel spectrogram by diffusion step

MIDI-driven prior

Step 0

Step 1

Step 10

Step 20

Step 30

Step 50

Step 100

MIDI-style prior

Step 0

Step 1

Step 10

Step 20

Step 30

Step 50

Step 100

<

Seen Speaker Singing Voice Synthesis

All speakers are seen during training


Zero-shot Singing Voice Style transfer

All speakers are unseen during training

Source Singing Voice Target Speaker Generated
Script: 언젠가 그 미소 내게 닿을 수 있나요
(Pronunciation): eonjenga geu miso naege daheul su issnayo

GT
spk90

GT
s10_m

VISinger

Grad-TTS

MIDI-Voice (MIDI-driven)

MIDI-Voice (MIDI-style)

Script: 언젠가 그 미소 내게 닿을 수 있나요
(Pronunciation): eonjenga geu miso naege daheul su issnayo

GT
spk90

GT
s05_f

VISinger

Grad-TTS

MIDI-Voice (MIDI-driven)

MIDI-Voice (MIDI-style)

Script: 물어봐 난 누굴까 워어어 열심히 걸어가다 보면
(Pronunciation): mureobwa nan nugunga woeoeo yeolsimhi georeogada bomyeon

GT
spk81

GT
s02_m

VISinger

Grad-TTS

MIDI-Voice (MIDI-driven)

MIDI-Voice (MIDI-style)