| Abstract [eng] |
The master's thesis analyzes the application of machine learning models for imputing missing values in economic indicators at the NUTS 2 regional level. NUTS 2 regional economic indicators serve as a fundamental instrument for the allocation of European Union structural funds and the formulation of regional policy. However, the systematic occurrence of missing values in statistical datasets impedes objective cross-regional comparisons and undermines evidence-based policy-making. The objective of this study is to implement and optimise Random Forest and XGBoost machine learning models for economic indicator imputation and to evaluate their performance across varying levels of data missingness. The research methodology comprises a systematic literature review conducted in accordance with the PRISMA guidelines, statistical identification of the missing-data mechanism using Little's MCAR test, and an experimental framework employing a synthetic hold-out approach with three levels of missingness (10%, 20%, and 30%). Hyperparameter optimisation was performed using RandomizedSearchCV. The compiled dataset encompasses 78 economic indicators across 244 NUTS 2 regions over a 25-year period, enabling a comprehensive assessment of model stability and predictive accuracy under heterogeneous missing-data conditions. The experimental results demonstrate that the XGBoost model consistently outperforms the Random Forest model across all evaluation metrics, including normalised Root Mean Square Error (nRMSE), normalised Mean Absolute Error (nMAE), coefficient of determination (R²), and symmetric Mean Absolute Percentage Error (sMAPE). Although hyperparameter optimisation substantially enhances the stability and predictive accuracy of the Random Forest model, the XGBoost model achieves superior performance even with baseline parameter configurations across all missingness scenarios. The results confirm that the optimised XGBoost model constitutes the most appropriate solution for imputing missing economic indicators at the NUTS 2 regional level, ensuring both high predictive accuracy and robust performance. These results carry practical implications for European Union statistical agencies and regional policy-makers, providing a methodological foundation for more reliable assessments of regional economic conditions and resilience. |