Title Kai kurių lietuvių kalbos teksto kirčiavimo aspektų matematinis modeliavimas /
Translation of Title Mathematical modelling of some aspects of stressing a Lithuanian text.
Authors Anbinderis, Tomas
Full Text Download
Pages 160
Keywords [eng] clitics ; homographs ; text stressing ; text-to-speech synthesis
Abstract [eng] The present dissertation deals with one of the speech synthesizer components – automatic stressing of a text and two other goals relating to it – homographs (words that can be stressed in several ways) disambiguation and a search for clitics (unstressed words). The method, which by means of decision trees finds sequences of letters that unambiguously define the word stressing, was applied to stress a Lithuanian text. Decision trees were created using large corpus of stressed words. Stressing rules based on sequences of letters at the beginning, ending and in the middle of a word have been formulated. The algorithm proposed reaches the accuracy of about 95.5%. The homograph disambiguation algorithm proposed by the present author is based on frequencies of lexemes and morphological features, that were obtained from corpus containing about one million words. Such methods were not used for Lithuanian language so far. The proposed algorithm enables to select the correct variant of stressing within the accuracy of 85.01%. Besides the author proposes methods of four types to search for the clitics in a Lithuanian text: methods based on recognising the combinational forms, based on statistical stressed/unstressed frequency of a word, grammar rules and stressing of the adjacent words. It is explained how to unite all the methods into a single algorithm. 4.1% of errors was obtained for the testing data among all the words, and the ratio of errors and unstressed words accounts for 18.8%.
Type Doctoral thesis
Language Lithuanian
Publication date 2010