Abstract [eng] |
Today, a largely scalable computing environment provides a possibility of carrying out various data-intensive natural language processing and machine-learning tasks. One of them is a classification of textual data with some issues recently investigated by many data scientists. In this dissertation, big data-classification tasks will be completed by using the machine learning toolkit MLlib on the Apache Spark, the in-memory intensive data analytics framework. Such intensive in-memory computations open the door to classification methods that are effective in solving big-data multi-class text-classification tasks. In this thesis, a multi-class classification of Naïve Bayes, Random Forest, Decision Tree, Support Vector Machines, Logistic Regression and Multilayer perceptron classifiers are experimentally examined and compared with a focus on evaluating the classification accuracy, based on the size of training datasets, and the number of n-grams. The proposed data feature selection such as a combination of n-grams, term frequency, inverse document frequency, part of speech, noise reduction, and used classifiers, determines multi-class classification problem with a higher classification accuracy. Findings indicate the optimal data feature selection that can be used in a variety of short texts, such as product-review classification within sentiment analysis. Applied data analytics frameworks are horizontally scalable in the multi-node cloud computing environment and allow us to run the mostly known classification algorithms to understand and predict the textual data that support knowledge gathering and decision-making processes. In the experiments, short texts for product-review data from Amazon were analyzed. |