UUM ETD | Universiti Utara Malaysian Electronic Theses and Dissertation
FAQs | Feedback | Search Tips | Sitemap

An enhanced resampling technique for imbalanced data sets

Maisarah, Zorkeflee (2015) An enhanced resampling technique for imbalanced data sets. Masters thesis, Universiti Utara Malaysia.

[img] Text
s814594.pdf
Restricted to Registered users only

Download (954kB)
[img]
Preview
Text
s814594_abstract.pdf

Download (396kB) | Preview

Abstract

A data set is considered imbalanced if the distribution of instances in one class (majority class) outnumbers the other class (minority class). The main problem related to binary imbalanced data sets is classifiers tend to ignore the minority class. Numerous resampling techniques such as undersampling, oversampling, and a combination of both techniques have been widely used. However, the undersampling and oversampling techniques suffer from elimination and addition of relevant data which may lead to poor classification results. Hence, this study aims to increase classification metrics by enhancing the undersampling technique and combining it with an existing oversampling technique. To achieve this objective, a Fuzzy Distancebased Undersampling (FDUS) is proposed. Entropy estimation is used to produce fuzzy thresholds to categorise the instances in majority and minority class into membership functions. FDUS is then combined with the Synthetic Minority Oversampling TEchnique (SMOTE) known as FDUS+SMOTE, which is executed in sequence until a balanced data set is achieved. FDUS and FDUS+SMOTE are compared with four techniques based on classification accuracy, F-measure and Gmean. From the results, FDUS achieved better classification accuracy, F-measure and G-mean, compared to the other techniques with an average of 80.57%, 0.85 and 0.78, respectively. This showed that fuzzy logic when incorporated with Distance-based Undersampling technique was able to reduce the elimination of relevant data. Further, the findings showed that FDUS+SMOTE performed better than combination of SMOTE and Tomek Links, and SMOTE and Edited Nearest Neighbour on benchmark data sets. FDUS+SMOTE has minimised the removal of relevant data from the majority class and avoid overfitting. On average, FDUS and FDUS+SMOTE were able to balance categorical, integer and real data sets and enhanced the performance of binary classification. Furthermore, the techniques performed well on small record size data sets that have of instances in the range of approximately 100 to 800.

Item Type: Thesis (Masters)
Uncontrolled Keywords: Imbalanced data, Resampling technique, Undersampling technique, Oversampling technique, Fuzzy logic
Subjects: Q Science > QA Mathematics > QA76 Computer software > QA76.76 Fuzzy System.
Divisions: Awang Had Salleh Graduate School of Arts & Sciences
Depositing User: Mr. Badrulsaman Hamid
Date Deposited: 20 Dec 2015 02:14
Last Modified: 24 Apr 2016 04:13
URI: http://etd.uum.edu.my/id/eprint/5330

Actions (login required)

View Item View Item