Title Feature selection for a better imbalanced data classification: a financial fraud detection case
Translation of Title Požymių atranka siekiant gerinti nesubalansuotų duomenų klasifikavimą: finansinio sukčiavimo atvejis.
Authors Breskuvienė, Dalia
DOI 10.15388/vu.thesis.848
Full Text Download
Pages 168
Keywords [eng] imbalanced data ; fraud detection ; SOM ; feature selection
Abstract [eng] Fraud detection remains a critical challenge in the financial sector, requiring innovative approaches to detect and prevent losses caused by increasingly sophisticated fraudulent activities. This dissertation addresses several aspects of improving fraud detection: using clustering as a preprocessing step, encoding strategies for imbalanced data, and feature selection importance. First, we propose a clustering-based classification method to increase the recall in credit card fraud detection. By optimizing feature selection and the number of clusters to form more homogeneous subsets for training and strategically undersampling each cluster, we improved the recall from 0.845 to 0.867, statistically significantly reducing the number of misclassified fraudulent cases by 13.9\%. Second, we investigate the impact of categorical feature encoding on model performance. Through experiments on datasets with less than 1\% fraud prevalence and the application of six encoding methods, we find that target-based encoding, especially James-Stein and Weight of Evidence (WOE), significantly outperform alternatives like CatBoost encoding in imbalanced settings. Our results highlight the importance of careful preprocessing, especially when dealing with high-cardinality categorical features and the curse of dimensionality. Finally, we introduce FID-SOM (Feature Selection for Imbalanced Data Using SOM), a novel feature selection method tailored for highly imbalanced datasets. Leveraging self-organizing maps, the FID-SOM identifies and ranks features on the basis of their contribution to best-matching units' weight vector attribute variability, enabling effective dimensionality reduction without losing critical information. The experimental results show that FID-SOM can match or surpass traditional feature selection techniques in fraud detection tasks. Our findings offer a comprehensive framework to enhance machine learning-based fraud detection in real-world, large-scale, and highly imbalanced datasets.
Dissertation Institution Vilniaus universitetas.
Type Doctoral thesis
Language English
Publication date 2025