Audio Samples from "Mandarin Singing Voice Synthesis with Denoising Diffusion Probabilistic Wasserstein GAN"

Abstract: Singing voice synthesis (SVS) is the computer production of a human-like singing voice from given musical scores. To accomplish end-to-end SVS effectively and efficiently, this work adopts the acoustic model-neural vocoder architecture established for high-quality speech and singing voice synthesis. Specifically, this work aims to pursue a higher level of expressiveness in synthesized voices by combining the diffusion denoising probabilistic model (DDPM) and Wasserstein generative adversarial network (WGAN) to construct the backbone of the acoustic model. On top of the proposed acoustic model, a HiFiGAN neural vocoder is adopted with integrated fine-tuning to ensure optimal synthesis quality for the resulting end-to-end SVS system. This end-to-end system was evaluated with the multi-singer Mpop600 Mandarin singing voice dataset. In the experiments, the proposed system exhibits improvements over previous landmark counterparts in terms of musical expressiveness and high-frequency acoustic details. Moreover, the adversarial acoustic model converged stably without the need of enforcing reconstruction objectives, which shows the convergence stability of the proposed DDPM and WGAN combined architecture over alternative GAN-based SVS systems..

Contents:

Notes:


Samples synthesized by different systems

One musical score segment was sampled from each of Mpop600's two female and two male singers' testing data.

Female1 Sample

Reference Audio

HiFiGAN Resynthesized

Model-FFT

Model-Diff-L1

Model-Diff-Mixed

Model-Diff-WGAN

Female2 Sample

Reference Audio

HiFiGAN Resynthesized

Model-FFT

Model-Diff-L1

Model-Diff-Mixed

Model-Diff-WGAN

Male1 Sample

Reference Audio

HiFiGAN Resynthesized

Model-FFT

Model-Diff-L1

Model-Diff-Mixed

Model-Diff-WGAN

Male2 Sample

Reference Audio

HiFiGAN Resynthesized

Model-FFT

Model-Diff-L1

Model-Diff-Mixed

Model-Diff-WGAN


Same musical score, different singer vector

Two musical score segments were sampled from one female and one male singer. In synthesis, each model was given the same musical score segment but with different singer identity vectors to test the multi-singer synthesis capability. Please notice that no pitch-range adjustment was done on the input musical scores.

Sample1 original singer reference:

Synthesized by Model-Diff-L1

with Female1 singer vector

with Female2 singer vector

with Male1 singer vector

with Male2 singer vector

Synthesized by Model-Diff-WGAN

with Female1 singer vector

with Female2 singer vector

with Male1 singer vector

with Male2 singer vector

Sample2 original singer reference:

Synthesized by Model-Diff-L1

with Female1 singer vector

with Female2 singer vector

with Male1 singer vector

with Male2 singer vector

Synthesized by Model-Diff-WGAN

with Female1 singer vector

with Female2 singer vector

with Male1 singer vector

with Male2 singer vector