Table of Links
Abstract and 1 Introduction
2 Related Work
2.1 Neural Codec Language Models and 2.2 Non-autoregressive Models
2.3 Diffusion Models and 2.4 Zero-shot Voice Cloning
3 Hierspeech++ and 3.1 Speech Representations
3.2 Hierarchical Speech Synthesizer
3.3 Text-to-Vec
3.4 Speech Super-resolution
3.5 Model Architecture
4 Speech Synthesis Tasks
4.1 Voice Conversion and 4.2 Text-to-Speech
4.3 Style Prompt Replication
5 Experiment and Result, and Dataset
5.2 Preprocessing and 5.3 Training
5.4 Evaluation Metrics
5.5 Ablation Study
5.6 Zero-shot Voice Conversion
5.7 High-diversity but High-fidelity Speech Synthesis
5.8 Zero-shot Text-to-Speech
5.9 Zero-shot Text-to-Speech with 1s Prompt
5.10 Speech Super-resolution
5.11 Additional Experiments with Other Baselines
6 Limitation and Quick Fix
7 Conclusion, Acknowledgement and References
6 LIMITATION AND QUICK FIX
Although our model improve the zero-shot speech synthesis performance significantly, our model also synthesizes the noisy environmental information from noisy prompt. In this work, we do not disentangle the voice and noise in voice modeling so the model generates a repeated background noise from the global voice style representation. To address this issue, we utilize a denoiser [58] to remove a noisy representation in voice style representation. Before fed to style encoder, the audio is fed to denoiser, and the denoised audio is transformed by STFT. Then, the denoised Melspectrogram is fed to style encoder. TABLE 12 shows that the using denoiser for style encoder simply improve the audio quality in terms of UTMOS. However, the denoised style also decreases the reconstruction quality in terms of CER and WER. As the results of denoised GT degraded all metrics, we found that the denoiser we used also removed the speech part. In this case, the pronunciation of synthetic speech also decreases. To reduce this issue, we interpolate the style representations from the original speech and denoised speech by denoising ratio of ratiod. This simple interpolation significantly improves the audio quality by removing the noisy environmental information without the decrease of CER and WER. For SECS, the denoised speech also shows low SECS and this means that Resemblyzer is also affected by environmental information such as reverberation. It is worth noting that we only utilize a denoiser for style encoder during inference. Meanwhile, using denoiser on the synthetic speech also degrades performance in terms of all metrics as the results of HierSpeech++♠.
Authors:
(1) Sang-Hoon Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(2) Ha-Yeong Choi, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(3) Seung-Bin Kim, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea;
(4) Seong-Whan Lee, Fellow, IEEE with the Department of Artificial Intelligence, Korea University, Seoul 02841, South Korea and a Corresponding author.