Abstract [eng] |
The goal of this thesis is to estimate the proportion of e-commerce enterprises in Lithuania using machine learning methods. In the process, enterprises were manually classified into e-commerce and non-e-commerce to have actual values of enterprises’ e-commerce status for models training and performance testing. Companies' websites were scraped to collect the text from them. Machine learning algorithms were combined with NLP methods for classification task. Logistic regression, Naïve Bayes, Support Vector Machines, Extreme Gradient Boosting, and BERT models were used to classify enterprises as e-commerce and non-e-commerce based on the extracted text from their websites. The inverse probability weighting estimator was applied to estimate the proportion of e-commerce enterprises in Lithuania. The estimated proportion of e-commerce enterprises in Lithuania is 0.25. The BERT model achieved the best performance classifying enterprises. |