Title |
The n-grams based text similarity detection approach using self-organizing maps and similarity measures / |
Authors |
Stefanovič, Pavel ; Kurasova, Olga ; Štrimaitis, Rokas |
DOI |
10.3390/app9091870 |
Full Text |
|
Is Part of |
Applied sciences.. Basel : MDPI AG. 2019, vol. 9, iss. 9, art. no. 1870, p. [1-14].. eISSN 2076-3417 |
Keywords [eng] |
self-organizing maps ; text mining ; text similarity measures ; n-grams ; frequency matrix |
Abstract [eng] |
In the paper the word-level n-grams based approach is proposed to find similarity between texts. The approach is a combination of two separate and independent techniques: self-organizing map (SOM) and text similarity measures. SOM’s uniqueness is that the obtained results of data clustering, as well as dimensionality reduction, are presented in a visual form. The four measures have been evaluated: cosine, dice, extended Jaccard’s, and overlap. First of all, texts have to be converted to numerical expression. For that purpose, the text has been split into the word-level n-grams and after that, the bag of n-grams has been created. The n-grams’ frequencies are calculated and the frequency matrix of dataset is formed. Various filters are used to create a bag of n-grams: stemming algorithms, number and punctuation removers, stop words, etc. All experimental investigation has been made using a corpus of plagiarized short answers dataset. |
Published |
Basel : MDPI AG |
Type |
Journal article |
Language |
English |
Publication date |
2019 |
CC license |
|