ROC AUC
Accuracy
Machine Learning
Model Performance
Evaluation Metrics

Getting a low ROC AUC score but a high accuracy

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Understanding the Discrepancy Between ROC AUC and Accuracy

When evaluating classification models, we often rely on several metrics to paint an accurate picture of the model's performance. Among these metrics, accuracy and the Area Under the Receiver Operating Characteristic Curve (ROC AUC) are widely used. However, scenarios exist where a model might exhibit high accuracy but a low ROC AUC score, leading to possible confusion about the model’s effectiveness. This article will dissect why this discrepancy can occur and how it impacts model evaluation.

Accuracy: A Basic Overview

Accuracy is the simplest and most intuitive metric for classification models. It is defined as the ratio of the number of correct predictions to the total number of predictions made:

Accuracy=Number of Correct PredictionsTotal Number of Predictions\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}

While easy to understand, accuracy can be misleading, especially in imbalanced datasets. For instance, if 90% of your dataset consists of one class, a naive model that predicts the majority class all the time will have 90% accuracy but lacks meaningful predictive power on minority class examples.

ROC AUC: A More Nuanced View

The ROC AUC score evaluates the quality of a model's predictions over all possible classification thresholds. It measures the model's capability to distinguish between classes and is particularly useful for binary classification problems. Higher values indicate better performance, with a score of 1.0 representing a perfect model and a score of 0.5 indicating performance no better than random guessing.

The ROC curve itself is a plot of the true positive rate (TPR) against the false positive rate (FPR) at various threshold levels:

True Positive Rate (TPR), also known as recall or sensitivity: True PositivesTrue Positives+False Negatives\frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}

False Positive Rate (FPR): False PositivesFalse Positives+True Negatives\frac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}}

The Discrepancy: High Accuracy, Low ROC AUC

Imbalanced Datasets

When dealing with imbalanced datasets, it is not uncommon to encounter models with high accuracy and low ROC AUC. This occurs because accuracy overemphasizes the majority class, while ROC AUC takes into account the model's ability to predict both positive and negative classes across different thresholds.

Example Scenario

Consider a binary classification problem where 95% of the examples belong to the negative class, and 5% belong to the positive class. Suppose a model predicts every observation as the negative class. Here’s the breakdown:

Accuracy: 95% correct because it predicts the majority class correctly. • TPR: 0. The model fails to predict any positive instances. • FPR: 0. No negative instances are incorrectly predicted as positive.

In this scenario, while accuracy is high, the ROC AUC will be approximately 0.5, revealing no discrimination capacity between classes.

Threshold Insensitivity

Accuracy calculates based on a fixed threshold (often 0.5 for logistic regression), while ROC AUC evaluates performance across all thresholds. A high accuracy might result from a single threshold coinciding with the dataset distribution, but low ROC AUC indicates poor performance across alternative decision thresholds.

Techniques to Reconcile Discrepancies

Class Re-balancing: Adjust training data so the model doesn't become biased towards the majority class. Methods include oversampling, undersampling, or generating synthetic samples of the minority class.

Evaluation Metrics: Focus on metrics such as F1-score, precision, recall, and specifically ROC AUC in imbalanced scenarios.

Customized Thresholds: Tailor the decision threshold to find a good balance between sensitivity and specificity, potentially improving both model accuracy and AUC score in a more holistic performance view.

Key Takeaways

Understanding the dynamics between accuracy and ROC AUC can help mitigate misconceptions regarding model performance. Here’s a summary of the key points:

MetricDefinitionUse CaseLimitations
AccuracyRatio of correct predictions to total predictionsGood for balanced datasets Simple to computeMisleading in imbalanced settings Threshold-dependent
ROC AUCArea under the ROC curveEvaluates all thresholds Good discriminative abilityMay be complex to understand; Sensitive to class distribution

In conclusion, always evaluate the context and the specific problem characteristics when choosing the right metric. Balancing the strengths and weaknesses of different evaluation measures is critical for building reliable and robust machine learning models.


Course illustration
Course illustration

All Rights Reserved.