Generating High-Quality F0 Embeddings Using the Vector-Quantized Variational Autoencoder

Porteš,  David; Horák,  Aleš

Generating High-Quality F0 Embeddings Using the Vector-Quantized Variational Autoencoder

Warning

This publication doesn't include Faculty of Sports Studies. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	PORTEŠ David HORÁK Aleš
Year of publication	2024
Type	Article in Proceedings
Conference	Text, Speech, and Dialogue
MU Faculty or unit	Faculty of Informatics
Citation
Doi	http://dx.doi.org/10.1007/978-3-031-70566-3_13
Keywords	Fundamental Frequency; Prosody; VQ-VAE; Vector Embeddings
Description	Language models operating on discrete audio representa- tions are increasingly becoming the go-to framework for many speech- processing tasks. Recently, discrete embeddings of the fundamental fre- AQ1 quency (F0), have been shown to improve performance across a variety of tasks. However, the benefits of using F0 embeddings can only be as good as the embeddings themselves. Therefore, in this paper, we present an exhaustive study on using the Vector-Quantized Variational Autoencoder (VQ-VAE) to generate high-quality embeddings of the F0 curve. We experiment with various input transformations that focus on handling unvoiced regions of the F0, which are regions where F0 is not defined. For each transformation, we perform an exhaustive grid search over the embedding size and codebook size parameters, in order to achieve high- est possible embedding quality. Our experiments are conducted on two different-sized datasets, LJSpeech and LibriTTS, and, in total, comprise over 140 different experiment settings. We reach results ranging from 0.53% to 4.29% F0 Frame Error (FFE), depending on the dataset and preprocessing strategy used, and we publish our best models on the Hug- gingFace website.
Related projects:	Using artificial intelligence techniques for data processing, complex analysis and visualization of large-scale data