Title Transformer-Based lithuanian text stressing for speech synthesis
Translation of Title Transformeriu grįstas lietuviško teksto kirčiavimas kalbos sintezei.
Authors Mackevič, Robert
Full Text Download
Pages 52
Keywords [eng] Transformer, Lithuanian text stressing, speech synthesis, natural language processing, context awareness.
Abstract [eng] This thesis explores the integration of automatic stress prediction into Lithuanian text-to-speech (TTS) systems by developing and evaluating a Transformer-based model for assigning stress marks in written text. Lithuanian is a stress-sensitive and morphologically rich language where stress is typically omitted in writing and must be inferred from context, which poses challenges for natural-sounding speech synthesis. The proposed approach introduces a modular neural network that explicitly assigns stress marks prior to synthesis, aiming to improve stress realization in synthesized speech. Two core hypotheses were investigated: (1) that a neural network model can outperform prominent rule-based stressing tools (such as “Kirčiuoklis” developed by Vytautas Magnus University) in stress prediction by leveraging sentence-level context, and (2) that using automatically stressed text as input leads to more context-aware stress realization in synthesized speech compared to using plain-text or phonemic input. Experimental results support these claims. The Transformer model achieved higher stress prediction accuracy than the prominent rule-based tools and was capable of generalizing to unseen words during inference. However, in the case of contextual awareness evaluation, the rule-based “Kirčiuoklis” tool outperformed the Transformer, revealing limitations in the model’s ability to effectively leverage context. Nevertheless, TTS models trained on stress-annotated text produced significantly more accurate stress realization in audio compared to plain-text or phoneme-based models. This confirms the practical benefit of explicit stress modeling, especially for low-resource languages. The findings suggest that while end-to-end TTS remains the long-term goal, a modular approach to stress assignment offers immediate and scalable improvements for Lithuanian speech synthesis. Future work should focus on expanding the annotated dataset and refining the model architecture to enhance context awareness and generalization.
Dissertation Institution Vilniaus universitetas.
Type Master thesis
Language English
Publication date 2026