Feature selection for a better imbalanced data classification: a financial fraud detection case

Dalia Breskuvienė

doi:10.15388/vu.thesis.848

Title	Feature selection for a better imbalanced data classification: a financial fraud detection case
Translation of Title	Požymių atranka siekiant gerinti nesubalansuotų duomenų klasifikavimą: finansinio sukčiavimo atvejis.
Authors	Breskuvienė, Dalia
DOI	10.15388/vu.thesis.848
Full Text
Pages	168
Keywords [eng]	imbalanced data ; fraud detection ; SOM ; feature selection
Abstract [eng]	Fraud detection remains a critical challenge in the financial sector, requiring innovative approaches to detect and prevent losses caused by increasingly sophisticated fraudulent activities. This dissertation addresses several aspects of improving fraud detection: using clustering as a preprocessing step, encoding strategies for imbalanced data, and feature selection importance. First, we propose a clustering-based classification method to increase the recall in credit card fraud detection. By optimizing feature selection and the number of clusters to form more homogeneous subsets for training and strategically undersampling each cluster, we improved the recall from 0.845 to 0.867, statistically significantly reducing the number of misclassified fraudulent cases by 13.9\%. Second, we investigate the impact of categorical feature encoding on model performance. Through experiments on datasets with less than 1\% fraud prevalence and the application of six encoding methods, we find that target-based encoding, especially James-Stein and Weight of Evidence (WOE), significantly outperform alternatives like CatBoost encoding in imbalanced settings. Our results highlight the importance of careful preprocessing, especially when dealing with high-cardinality categorical features and the curse of dimensionality. Finally, we introduce FID-SOM (Feature Selection for Imbalanced Data Using SOM), a novel feature selection method tailored for highly imbalanced datasets. Leveraging self-organizing maps, the FID-SOM identifies and ranks features on the basis of their contribution to best-matching units' weight vector attribute variability, enabling effective dimensionality reduction without losing critical information. The experimental results show that FID-SOM can match or surpass traditional feature selection techniques in fraud detection tasks. Our findings offer a comprehensive framework to enhance machine learning-based fraud detection in real-world, large-scale, and highly imbalanced datasets.
Dissertation Institution	Vilniaus universitetas.
Type	Doctoral thesis
Language	English
Publication date	2025

„Feature selection for a better imbalanced data classification: a financial fraud detection case“