Abstract [eng] |
Rapid key information extraction from text documents is a pressing problem, as it is texts that make up the bulk of the unstructured data generated. The complexity of the Lithuanian language and having different word forms complicates the task even more. This fact encourages the search for text-characterizing elements that are simpler in structure than the word. The application of various methods using Lithuanian syllables has not been studied before. In this work, the syllables properties of Lithuanian (or translated into Lithuanian) fiction texts are explored. The possibilities to use the syllables characteristics for texts classification are investigated. A new algorithm for classifying text fragments by genre is developed using two-stage logistic regression. Initially, syllable odds are modeled using binomial logistic regression. In the second stage, the characteristics of the odds are modeled and other syllable features are used for classification. The developed algorithm is compared with other classification algorithms. |