Maisarah, Zorkeflee (2015) An enhanced resampling technique for imbalanced data sets. Masters thesis, Universiti Utara Malaysia.
s814594.pdf
Download (954kB) | Preview
s814594_abstract.pdf
Download (396kB) | Preview
Abstract
A data set is considered imbalanced if the distribution of instances in one class (majority class) outnumbers the other class (minority class). The main problem related
to binary imbalanced data sets is classifiers tend to ignore the minority class. Numerous resampling techniques such as undersampling, oversampling, and a combination of both techniques have been widely used. However, the undersampling and oversampling techniques suffer from elimination and addition of relevant data which may lead to poor classification results. Hence, this study aims to increase classification metrics by enhancing the undersampling technique and combining it
with an existing oversampling technique. To achieve this objective, a Fuzzy Distancebased
Undersampling (FDUS) is proposed. Entropy estimation is used to produce fuzzy thresholds to categorise the instances in majority and minority class into membership functions. FDUS is then combined with the Synthetic Minority
Oversampling TEchnique (SMOTE) known as FDUS+SMOTE, which is executed in sequence until a balanced data set is achieved. FDUS and FDUS+SMOTE are compared with four techniques based on classification accuracy, F-measure and Gmean. From the results, FDUS achieved better classification accuracy, F-measure and G-mean, compared to the other techniques with an average of 80.57%, 0.85 and 0.78, respectively. This showed that fuzzy logic when incorporated with Distance-based Undersampling technique was able to reduce the elimination of relevant data. Further, the findings showed that FDUS+SMOTE performed better than combination of
SMOTE and Tomek Links, and SMOTE and Edited Nearest Neighbour on benchmark data sets. FDUS+SMOTE has minimised the removal of relevant data from the majority class and avoid overfitting. On average, FDUS and FDUS+SMOTE were able to balance categorical, integer and real data sets and enhanced the performance
of binary classification. Furthermore, the techniques performed well on small record
size data sets that have of instances in the range of approximately 100 to 800.
Item Type: | Thesis (Masters) |
---|---|
Supervisor : | Mohamed Din, Aniza and Ku Mahamud, Ku Ruhana |
Item ID: | 5330 |
Uncontrolled Keywords: | Imbalanced data, Resampling technique, Undersampling technique, Oversampling technique, Fuzzy logic |
Subjects: | Q Science > QA Mathematics > QA76 Computer software > QA76.76 Fuzzy System. |
Divisions: | Awang Had Salleh Graduate School of Arts & Sciences |
Date Deposited: | 20 Dec 2015 02:14 |
Last Modified: | 04 Apr 2021 07:31 |
Department: | Awang Had Salleh Graduate School of Arts and Sciences |
Name: | Mohamed Din, Aniza and Ku Mahamud, Ku Ruhana |
URI: | https://etd.uum.edu.my/id/eprint/5330 |