Title Lithuanian speech synthesis using neural networks /
Translation of Title Kalbą generuojantys neuroniniai tinklai lietuvių kalbai.
Authors Radzevičius, Arnas
Full Text Download
Pages 63
Keywords [eng] Keywords: natural language processing, NLP, speech synthesis, text-to-speech, TTS, phone- mic orthography, automatic text stressing, automatic accentuation, speech dataset, speech corpus, Tacotron 2, Waveglow, VITS, kalbos sintezė, sintezatorius, automatinis kirčiuoklis, kirčiuoklis, kalbos duomenų rinkinys
Abstract [eng] This master’s thesis work proposes an approach to using stressed text instead of phonemes for TTS neural network inputs to solve the pronunciation problem of synthesized speech for higher-degree phonemic orthography languages. Tacotron 2 and VITS neural network architectures were used to train neural networks on multiple Lithuanian language datasets. Three single-speaker Lithuanian language speech corpora were collected to be used for the model training experiments, totaling 6, 27, and 92 hours of speech data, respectively. Finally, a survey is conducted to calculate MOS scores and evaluate each trained TTS neural network. Furthermore, the initial experimental results of training a neural network-based accentuation model are detailed. The accentuation model is required as a pre-processing component for the TTS model to solve the synthesized speech pronunciation problem. The best-trained model achieves an accuracy (character-level) of 93%, but the model is not practical since it assigns stress marks to all the letters in the input sequence instead of assigning a single pitch accent for each word in the sequence. The readers are provided a link to a website demonstrating the speech samples generated by the developed synthesizers. Also, the base pre-trained neural network models are provided in the links below.
Dissertation Institution Vilniaus universitetas.
Type Master thesis
Language English
Publication date 2022