Keywords [eng] |
Keywords: natural language processing, NLP, speech synthesis, text-to-speech, TTS, phone- mic orthography, automatic text stressing, automatic accentuation, speech dataset, speech corpus, Tacotron 2, Waveglow, VITS, kalbos sintezė, sintezatorius, automatinis kirčiuoklis, kirčiuoklis, kalbos duomenų rinkinys |
Abstract [eng] |
This master’s thesis work proposes an approach to using stressed text instead of phonemes for TTS neural network inputs to solve the pronunciation problem of synthesized speech for higher-degree phonemic orthography languages. Tacotron 2 and VITS neural network architectures were used to train neural networks on multiple Lithuanian language datasets. Three single-speaker Lithuanian language speech corpora were collected to be used for the model training experiments, totaling 6, 27, and 92 hours of speech data, respectively. Finally, a survey is conducted to calculate MOS scores and evaluate each trained TTS neural network. Furthermore, the initial experimental results of training a neural network-based accentuation model are detailed. The accentuation model is required as a pre-processing component for the TTS model to solve the synthesized speech pronunciation problem. The best-trained model achieves an accuracy (character-level) of 93%, but the model is not practical since it assigns stress marks to all the letters in the input sequence instead of assigning a single pitch accent for each word in the sequence. The readers are provided a link to a website demonstrating the speech samples generated by the developed synthesizers. Also, the base pre-trained neural network models are provided in the links below. |