Title Lithuanian text difficulty characterization with syllables frequencies /
Translation of Title Lietuviškų tekstų sudėtingumo analizė.
Authors Štulaitė, Laima
Full Text Download
Pages 52
Keywords [eng] Zipf’s law ; Yule model ; Beta model ; Zipf-Mandelbrot ; rank-frequency distribution ; sylla-ble‘s entropy rate ; syllable’s conditional entropy ; complex text classification ; gradient boost classifica-tion.
Abstract [eng] The frequency of words in a language is well-described by Zipf's (1949) law. However, studies at the syllable level are relatively rare in the field of quantitative linguistics, and Zipf's law does not neces-sarily describe the distribution of syllables. In examining the frequency of syllable occurrence in the Lithuanian language, I found that the ranked frequencies of syllables are best described by the Yule distribution model. The Yule equation fits the distribution of Lithuanian syllable rank frequencies bet-ter than the Zipf's, Beta, and Zipf-Mandelbrot models. To account for the complexity of the Lithuanian language, I employed Shannon and conditional entropy measures. The Shannon entropy rate averaged 8.91 information bits per syllable across the Lithuanian text corpus, and the conditional entropy aver-aged 6.45, conditioned on the preceding syllable. The Shannon entropy rate was used to classify more complex texts, and the gradient boost classification algorithm demonstrated the best accuracy and bal-ance in classifying fractions of syllables from 80 Lithuanian texts into complex and not complex cate-gories.
Dissertation Institution Vilniaus universitetas.
Type Master thesis
Language English
Publication date 2024