Luis Caballero Diaz's profile

Machine Learning Imbalanced Data Part 2

The project focuses on an machine learning exercise based on an imbalanced dataset. Knowledge of imbalanced datasets particularities, common scoring methods, tools and nomenclatures are assumed in this part. If an introduction to imbalanced dataset is needed, next link can be a perfect starting point prior to read this project.

Part 1 focused on introducing main challenges for an imbalanced dataset as selecting a proper scoring method and perform a model evaluation:

Part 2 focused on cross validation and grid search using pipeline to optimize AUC for model tuning and interpret the results:

As reference, the work is done with Python and scikit learn, and the dataset to use is an open dataset from kaggle. The code and dataset can be found in the below links.


This data is very extensive containing information for almost 300k transactions with a total of 30 features, plus a target output class indicating if the transaction was a fraud or not. However, there are only 492 fraud transactions, which corresponds to 0.172% of the cases.

Let's proceed to select different supervised machine learning algorithms potentially suitable for the current dataset. The memory usage and computational time is an important factor to consider in a dataset of almost 300k transactions with 30 features each (which leads close to 9 millions of data). Therefore, the current exercise is not suitable to select all algorithms and run a grid search cross validation with lots of parameters combinations since the computational time would be out of expectations. With these level of features and samples, linear models such as Linear Support Vector Classification (Linear SVC) and Logistic Regression tends to work well. Additionally, Naive Bayes with Gaussian algorithm might be also a good fit here. And finally, Decision Tree algorithm will be also used. Other more complex algorithms such as Random Forest, Gradient Boosting, Support Vector Machines or Neural Networks are discarded due to their complexity. The expectation is to have a good performance with linear models, slightly inferior performance with Naive Bayes but with faster execution, and reasonable but also inferior performance with Decision Tree. 

The GridSearchCV function is run in a pipeline with two steps: scaling preprocessing and algorithm classifier. However, as explained in the project part 1, this data comes already scaled because PCA was applied by data owner due to confidentiality reasons and so, first step for scaling will be ignored. 

The code for the grid search model parametrization is as follows. The pipeline cases are defined in the algorithm and scale lists. Each position corresponds to a condition for the pipeline, meaning the first run for example will use the algorithm defined in algorithm[0] and the scaler defined in scale[0]. 

To avoid scaling, it is simply defined the scale list as an empty list so that the cross_grid_validation function ignores the preprocessing step. Regarding grid parametrization, C parameter in linear models will sweep to control regularization (being higher C more likely to make more complex models leading to overfit). Naive Bayes Gaussian does not have parameters to tune. And Decision Tree will sweep max_depth parameter, being higher values more likely to overfit the model. The model and scaler estimator are left empty since they will be automatically filled in when running cross_grid_validation function using the information in algorithm and scale lists.

One important remark is that the grid search is defined under ROC AUC scoring since accuracy was demonstrated not to be useful in imbalanced datasets.
Create_model and create_scaler functions are shown below. They create a model for the selected algorithm/scaler with no parametrization, since it is filled in during the grid search according to the defined grid parametrization.
The results for each algorithm is depicted in below colormesh plots. The sweep parametrization is correct since the highest value is always in the middle of the matrix and not in the extremes. Having the max value in one of the extremes or corners would open the doubt if a better parametrization is possible by extending the sweep range. Instead of running a grid search with lots of parameters, which would be very demanding in memory and computational time, it is a common practice to run several minor grid searches until finding the proper parameter range to assess. 

The summary of the best cross validation AUC scores are summarized below.

LINEAR SVC --> 0.9822 with C = 0.0001
LOGISTIC REGRESSION --> 0.9852 with C = 0.01
NAIVE BAYES --> 0.9619
DECISION TREE --> 0.9154 with max_depth = 3

The results are consistent with expectation, linear models tends to behave very well with large high dimensional datasets. Naive Bayes too, but they tend to be faster in exchange of a slight reduction in scoring. Instead, Decision Tree provides a reasonable performance but in a lower range compared to the other algorithms. 

However, the definitive score must be always measured against the isolated testing dataset, which as discussed earlier, it is not perfectly isolated in this particular dataset due to a common scaling to apply PCA. That being said, the results against the testing set is as follows. The results shows the same trend as cross validation results but having all values slightly reduced, which is a sign of a potential minor overfit in the models.  

LINEAR SVC --> 0.9768 with C = 0.0001
LOGISTIC REGRESSION --> 0.978 with C = 0.01
NAIVE BAYES --> 0.9547
DECISION TREE --> 0.9028 with max_depth = 3
Let's plot the ROC curve for all four optimal configurations for each model. Remember the ideal point is top left corner, the circles are referred to the performance with default threshold for predict proba or decision function and the triangles refers to the optimal threshold for predict proba or decision function. Looking the four ROC curves, Decision Tree (purple) as the worst one. Then, Naive Bayes (green) would work better and finally both linear models (blue and red) would offer the best performance, being slightly better the Logistic Regression model. Therefore, both Logistic Regression C = 0.01 and Linear SVC C = 0.0001 would look the best model parametrization for the current dataset.​​​​​​​
It can be observed that a higher AUC leads to an overall better performance and match with the dataset trend and variance. However, having higher AUC does not mean the ROC curve is always superior than a case with lower AUC. For example Decision Tree ROC curve is better than Naive Bayes for very small FPR values, or Linear SVC is better than Logistic Regression for FPR around 0.15. Therefore, there might be some particular threshold in the ROC curve of a lower AUC model leading to better performance than a higher AUC model. However, that would be true only for that particular condition since as mentioned earlier, the higher AUC model overall captures better the dataset trend and variance. 

As additional exercises, the ROC curves for different C values in the Logistic Regression model are compared. It can be observed how the AUC increases when reducing C, since the model is less overfitted and it increases the testing performance. By reducing C too much, the model would be too simple leading to also bad testing test results as demonstrated above. However, conclusion would be that the C sensibility is low since no major changes in AUC comes when C changes. 

AUC C = 1 --> 0.9678
AUC C = 0.1 --> 0.9705
AUC C = 0.01 --> 0.978
The same exercise is repeated for Decision Tree model with max_depth values 3, 10 and 20. Higher values of max_depth leads to model overfit, and so bad testing set performance. Therefore, when max_depth is reduced the AUC increases. The sensibility of AUC with max_depth in Decision Tree would look higher than Logistic Regression with parameter C.

AUC max_depth = 3 --> 0.9028
AUC max_depth = 10 --> 0.8788
AUC max_depth = 20 --> 0.8546
As summary, below table depicts all analyzed cases in the project depicting the AUC and both TPR and FPR ratios for the default and optimal threshold scenarios. The default case is just for reference since model should be tuned to the optimal threshold to take benefit of a more robust and better modeling, and in turn, more accurate predictions. The most promising models (Logistic Regression C = 0.01 and Linear SVC C = 0.0001) have very similar performance. Looking into the details of both models, Logistic Regression C = 0.01 has a TPR = 0.8673 and FPR = 0.0096. Instead, Linear SVC has higher TPR (= 0.9082), but lower FPR (= 0.05). Therefore, the trade off for definitive model selection would be if increasing 4% TPR is worth it to also increase 5% FPR, or in other words, if reducing a 4% the false negative cases compensates increasing 5% the false positive cases. 

The above trade off decision is fully business case related. For example, in a cancer disease detection modeling would look pretty straightforward accepting the extra cost of that 5% extra false positive by reducing 4% the false negative since it might be translated to save more lives. In the current application of fraud transactions, that 4% reduction in false negative by increasing 5% the false positives would also look acceptable, since more fraud transactions would be identified in exchange of having more workload in checking additional transactions to end up confirming it was not fraud. Therefore, with that trade off decision more money would be in safe hands, and the final recommendation would be to use the model Linear SVC with parameter C equal to 0.0001.
I appreciate your attention and I hope you find this work interesting.

Luis Caballero
Machine Learning Imbalanced Data Part 2
Published:

Machine Learning Imbalanced Data Part 2

Published:

Creative Fields