Title |
Categorical feature encoding techniques for improved classifier performance when dealing with imbalanced data of fraudulent transactions / |
Authors |
BreskuvienÄ—, Dalia ; Dzemyda, Gintautas |
DOI |
10.15837/ijccc.2023.3.5433 |
Full Text |
|
Is Part of |
International journal of computers communications & control.. Oradea : Agora University. 2023, vol. 18, iss. 3, art. no. 5433, p. [1-17].. ISSN 1841-9836. eISSN 1841-9844 |
Keywords [eng] |
imbalanced data ; classifier ; feature encoding ; high-cardinality ; fraud detection |
Abstract [eng] |
Fraudulent transaction data tend to have several categorical features with high cardinality. It makes data preprocessing complicated if categories in such features do not have an order or meaningful mapping to numerical values. Even though many encoding techniques exist, their impact on highly imbalanced massive data sets is not thoroughly evaluated. Two transaction datasets with an imbalance lower than 1% of frauds have been used in our study. Six encoding methods were employed, which belong to either target-agnostic or target-based groups. The experimental procedure has involved the use of several machine-learning techniques, such as ensemble learning, along with both linear and non-linear learning approaches. Our study emphasizes the significance of carefully selecting an appropriate encoding approach for imbalanced datasets and machine learning algorithms. Using target-based encoding techniques can enhance model performance significantly. Among the various encoding methods assessed, the James-Stein andWeight of Evidence (WOE) encoders were the most effective, whereas the CatBoost encoder may not be optimal for imbalanced datasets. Moreover, it is crucial to bear in mind the curse of dimensionality when employing encoding techniques like hashing and One-Hot encoding. |
Published |
Oradea : Agora University |
Type |
Journal article |
Language |
English |
Publication date |
2023 |
CC license |
|