Abstract [eng] |
The frequency of words in a language is well-described by Zipf's (1949) law. However, studies at the syllable level are relatively rare in the field of quantitative linguistics, and Zipf's law does not neces-sarily describe the distribution of syllables. In examining the frequency of syllable occurrence in the Lithuanian language, I found that the ranked frequencies of syllables are best described by the Yule distribution model. The Yule equation fits the distribution of Lithuanian syllable rank frequencies bet-ter than the Zipf's, Beta, and Zipf-Mandelbrot models. To account for the complexity of the Lithuanian language, I employed Shannon and conditional entropy measures. The Shannon entropy rate averaged 8.91 information bits per syllable across the Lithuanian text corpus, and the conditional entropy aver-aged 6.45, conditioned on the preceding syllable. The Shannon entropy rate was used to classify more complex texts, and the gradient boost classification algorithm demonstrated the best accuracy and bal-ance in classifying fractions of syllables from 80 Lithuanian texts into complex and not complex cate-gories. |