You are a machine learning engineer and you are given a dataset containing data related to a rare but a fatal disease. You are asked to design a ML which can classify if the person suffers from that disease or not. After performing EDA and trying different algorithms, you fit the data to the model to your highest performing model. Eureka! the accuracy of the model is 98.37%. Satisfied with the high accuracy score, you test the model with real time data. Investigations reveal that the model only predicts 1.63% as “diagnosed positive”. This results in False Positive classification, costing life of the patients who were actually positive but were predicted negative by the model. This is where accuracy as an evaluation metric fails miserably. The aim of this article to focus on evaluation metrics used in classification modeling and handling classification in an imbalanced dataset.
In a classification model, the results of the evaluation metric, “accuracy”, can be misleading, just as we saw in the example, above. Evaluation metrics such as Confusion Matrix, Precision, Recall, F1 score and AUC ROC scores are highly reliable and can yield to accurate predictions. Please read, evaluation metrics in classification model before you continue reading this article.
Interpreting Precision & Recall values
· high recall & high precision: Classification by the model has been handled very well.
· low recall & high precision: The model cannot classify the classes well but whenever it does have been classified, they can be trusted upon.
· high recall & low precision: Classified most of it correctly but some are left out.
· low recall & low precision: Poor classification by the model.
Why does the model yield to high accuracy but low ROC score?
Sometimes, there is an unequal distribution of classes in the dataset in such a manner that the rare class constitutes to very small amount of data. And usually, the classification solution is expected to predict the rare class. Due to the imbalance in class, most commonly used classification algorithms do not perform very well, scoring low on precision and recall. The algorithms are also tilted towards predicting the majority class since their loss functions try to perform well in calculating error rate without taking into account the distribution of the feature.
How to handle an imbalanced dataset?
With an imbalanced dataset in hand, the developer is haunted with questions such as; how should the score of the precision / recall be improved? Several years of research on this as aspect has brought forth a few methods in handling such a dataset. Which is the best classification algorithm that can be fit to an imbalanced dataset? There is no concrete answer to this. What are the factors that affect the performance of an imbalanced dataset? Lack of correlation between the features and low variance play the key factors in low performance. So how should the dataset be handled?
Understand the goal
Before we deal with the problem, we must understand the final “goal” of the classification. For example, accuracy can be considered if the goal is classification of the majority class because the accuracy will be at the maximal while responding to this class. Lower scores due to other metrics should not be then viewed as a problem.
Use the metrics that gives us the best performance with respect to the goal, set. Let’s take a tour to understand the various methods in handling an imbalanced dataset. There are generally two major approaches to classification on an imbalanced data.
First is the cost-sensitive classification; assigning a high cost to incorrect classification thus trying to reducing the overall cost. And the second approach is using sampling technique (UnderSampling OR OverSampling) to minimise the imbalance in data.
Cost-sensitive learning is an area of training the algorithm to consider the overall costs involved while prediction of a class. Consider an imbalanced binary classification model, the majority class is labelled as Negative or Class 0 while the minority or rare class is labelled as Positive or Class 1. The challenge in an imbalanced set is to detect the positive class correctly since it is considered as a rare or an exceptional event. Just as in the example above, diagnosing if the person has the disease i.e. detect the positive class correctly is the GOAL. Here are a few examples of imbalanced datasets.
- Loan Disbursement: In this example, the positive class will be loan being disbursed to the customer. Loan disbursement denied to a good customer is acceptable but loan disbursed to a bad customer is considered a misclassification. This is critical because the cost incurred will be high if the loan is disbursed to a bad customer having poor Credit History. (We will see this example)
- Detecting Disease: Detecting the disease such as cancer is very crucial. True positive detection must be accurate, which means this model must perform with high precision and high recall score. False positive cases strictly cannot be accepted, which means classifying a patient as healthy when he actually has cancer, can jeopardise his life.
Cost-sensitive learning can be best understood using cost matrix. Just as confusion matrix gives us a summary of predictions presented by the model, cost matrix assigns cost to each cell in the confusion matrix, thereby minimising the overall cost while model training.
From the above cost matrix, we can infer that the cost of wrong detection is higher than correct detection. The cost for False Positive and False Negative is (1,0) while that of True Positive and True Negative is (1,1).
The effectiveness of cost sensitive learning during training totally depends on the weights assigned in the cost matrix. The value of the cost must be carefully chosen, so as to get the optimum performance.
Cost sensitive learning algorithms
The scikit-learn library provides the parameter class_weight to include cost-sensitive learning, in DecisionTreeClassifier and SVC classifiers. Cost-sensitive learning using DecisionTreeClassifier has gained a lot of attention. I have choosen XGBClassifier.
XGBClassifier is a highly effective machine learning algorithm with an array of hyperparameters designed to let us tinker the training of the model. It is also provides an efficient stochastic gradient, helping in faster convergence of the algorithm. It has also been high performing in imbalanced classification. XGBClassifier provides the hyperparameter scale_pos_weight to incorporate cost sensitive learning by model during training, tuning the model to minimize class imbalance. It uses the loss function to minimize the gradient (error). To tune the model for class imbalance, XGBClassifier uses scale_pos_weight to scale the gradient for the positive class.
scale_pos_weight = total_negative_examples / total_positive_examples.
Let us see how imbalance can be treated in the existing classifiers.
Training on an Imbalanced dataset
Let us consider ROC as the scoring measure to classify the dataset provided by Analytics Vidhya in an Online Hackathon 3.X to predict-customer-worth-for-happy-customer-bank. The dataset consists loan data. Based on the given features, the goal is to classify if the loan is disbursed or not.
Shape of the training data: (87020, 26) Shape of the test data: (37717, 24)
- Too many unique values in City, so replaced the cities with count < 125 as “Others”
- Replaced Employer_Name and Salary_Account with count < 30 as “Others”
- Replaced missing values for all categorical variables with “Others”. Processing_Fee, Existing_EMI and Loan_Amount_Applied missing values were replaced with 0
- Converted variables to numeric values using Label_Encoding
- Identified outliers using box-plot for the feature Monthly_Income and dropped values > 1000000
- Split DOB into three features — DOB_day, DOB_month and DOB_year. Dropped Lead_Creation_Date
- Remove the feature LoggedIn from training data, since the feature is missing in the test file
- Created new features — Current EMI and Debt_to_Income
- Have not fixed the missing values of the other float fields, since XGBoost handles missing values.
Using XGBoost as the classifier (you may try other classifiers too), the data fits the model with a high score of Accuracy — 98.4% but low score of ROC-50.4%. Accuracy cannot be taken as the evaluation metric in this case. The imbalance in classes has resulted in such a low score on Precision / Recall.
Now let us try tuning the XGBClassifier hyperparameter scale_pos_weight
weights = [1, 10, 25, 50, 75, 99, 100, 1000]
The ROC score of the class has improved after tuning the hyperparameter scale_pos_weight. There is an increased efficiency in Precision / Recall. The ROC score has improved 17.6% after adjusting the hyperparameter. Thus choosing an appropriate weight value for the hyperparameter, the model can be tuned to minimize misclassification and improve prediction.
Whenever there is an imbalance in data, the line of understanding is that the data is not a true representation of the population / reality. If this is so then, the next step towards resolving this issue is to collect more representative data. It’s not always possible to collect more real data. Sampling techniques can be applied to such a dataset to bridge the disproportion in classes. Let’s us see the various techniques, their merits & demerits and when they should be used. Sampling refers to a technique designed to balance a class distribution that is skewed.
Undersampling, Oversampling, SMOTE, Bagging With Random Undersampling
Undersampling is technique in which the data belonging to the majority class is removed from the training data, reducing the skew and balancing the distribution to 1:1 or 1:2. Although its an effective method, the data points are randomly deleted without any concern of how important they might be in decision making. This technique has been improvised by using heuristic learning models in undersampling which are able to identify redundant data points for deletion or do not mark useful data to be removed. There is a strong improvement in the ROC score (ROC score = 74.81%), after applying sampling technique to the training dataset. The precision and recall score after undersampling the data, is better than applying Cost-sensitive learning in this dataset.
Oversampling refers to duplicating data points from the minority class and adding them to the dataset, replacing the original dataset. This technique can be effective when the model is influenced by skewed distribution due to duplicate data points of the class. It can be greatly effective in the training dataset but may perform very poorly with unknown dataset. It can lead to high chance of overfitting due to the redundant copies in the minority class. This in turn can increase the computational cost while fitting the data and can result in poor performance on the test dataset.
Undersampling performs better than the Oversampling technique on this training dataset. Although the accuracy score is higher in oversampling of data, ROC value (61.18%) is significantly lower, tending to strong misclassification of classes.
SMOTE is another sampling technique which uses the oversampling duplicate technique by synthesizing new samples of data, also known as Synthetic Minority Oversampling Technique.
SMOTE works by drawing newer samples close to the feature space. SMOTE first selects random data point a from the minority class. Then it finds its k-nearest neighbour b. Then a synthetic data instance is created using the convex combination of the chosen data point and its k-nearest neighbour. The approach is effective since new synthetic data points are created which are close to selected one in the feature space. The SMOTE class acts like a data transform object from scikit-learn library. It must be defined and imported, fit on a dataset, then applied to create a new transformed version of the dataset.
Bagging With Random Undersampling
Ensembling is a machine learning technique where many models are combined together to form a robust model. The models combined are known as weak learners, which when combined, produce a stronger and effective model.
Who is a weak learner?
In Machine Learning, the two parameters that play a very important role in model building are variance and bias. Variance and bias tradeoff is where we want our model not to have too high degree of freedom to avoid high variance and build the model more robust. The basic models do not perform very well due to high variance or bias and thus known as weak learner. To get balance of variance and bias right, these weak learners are combined together to create a strong learner which achieves better performance. When homogenous weak learners learn parallelly from each other when combined, they produce an averaged outcome. This ensembling technique is bagging. The focus of bagging technique is to get an ensembled model with less variance.
There are several methods to use bagging to resolve an imbalanced class
The most commonly used technique is resampling using the bagging technique such as overbagging the minority class or underbagging the majority class. There is method that combines both the above — Overunderbagging
In ensemble classifiers, bagging methods build several estimators on different randomly selected subset of data. Most of these methods tilt towards the majority class while training on imbalanced data set. But here is a class BalancedBaggingClassifier in the imblearn.ensemble, where each bootstrap is further resampled to achieve the sampling strategy, desired. The performance of this classifier is controlled by two parameters, sampling_stategy and replacement. This classifier performs random undersampling of the majority class prior to fitting each decision tree, improving the performance of the model on an imbalanced dataset. After apply BalancedBaggingClassifier on the imbalanced loan dataset, the model performance improved to an ROC score of 77.61%.
- Choose evaluation metrics carefully while dealing classification. The evaluation metrics should be chosen keeping in mind the actual GOAL of the problem.
- Create new features to enrich the dataset which increases the importance of the feature while training the model and minimize the tilt towards the majority class.
- Use cross-validation before sampling the data, so that randomness can be introduced before model building and also minimizes the chances of overfitting.
- Consider cost-sensitive learning on imbalance classes before resampling of data. This may not increase accuracy but target lower prediction costs.
- Before applying sampling techniques on imbalanced dataset try considering these questions; Should the classes be rebalanced to contain the same proportion of data? If not, what should be the proportion of the positive class? Is the sample the true representation of the data? Should the majority class stay represented or does the problem pose another goal
- Sampling methods can be used to alter the balance of the dataset, but they must be considered with caution and keeping in mind the expected outcome from the algorithm.
Thanks for your patience!