Getting a low ROC AUC score but a high accuracy
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Understanding the Discrepancy Between ROC AUC and Accuracy
When evaluating classification models, we often rely on several metrics to paint an accurate picture of the model's performance. Among these metrics, accuracy and the Area Under the Receiver Operating Characteristic Curve (ROC AUC) are widely used. However, scenarios exist where a model might exhibit high accuracy but a low ROC AUC score, leading to possible confusion about the model’s effectiveness. This article will dissect why this discrepancy can occur and how it impacts model evaluation.
Accuracy: A Basic Overview
Accuracy is the simplest and most intuitive metric for classification models. It is defined as the ratio of the number of correct predictions to the total number of predictions made:
While easy to understand, accuracy can be misleading, especially in imbalanced datasets. For instance, if 90% of your dataset consists of one class, a naive model that predicts the majority class all the time will have 90% accuracy but lacks meaningful predictive power on minority class examples.
ROC AUC: A More Nuanced View
The ROC AUC score evaluates the quality of a model's predictions over all possible classification thresholds. It measures the model's capability to distinguish between classes and is particularly useful for binary classification problems. Higher values indicate better performance, with a score of 1.0 representing a perfect model and a score of 0.5 indicating performance no better than random guessing.
The ROC curve itself is a plot of the true positive rate (TPR) against the false positive rate (FPR) at various threshold levels:
• True Positive Rate (TPR), also known as recall or sensitivity:
• False Positive Rate (FPR):
The Discrepancy: High Accuracy, Low ROC AUC
Imbalanced Datasets
When dealing with imbalanced datasets, it is not uncommon to encounter models with high accuracy and low ROC AUC. This occurs because accuracy overemphasizes the majority class, while ROC AUC takes into account the model's ability to predict both positive and negative classes across different thresholds.
Example Scenario
Consider a binary classification problem where 95% of the examples belong to the negative class, and 5% belong to the positive class. Suppose a model predicts every observation as the negative class. Here’s the breakdown:
• Accuracy: 95% correct because it predicts the majority class correctly. • TPR: 0. The model fails to predict any positive instances. • FPR: 0. No negative instances are incorrectly predicted as positive.
In this scenario, while accuracy is high, the ROC AUC will be approximately 0.5, revealing no discrimination capacity between classes.
Threshold Insensitivity
Accuracy calculates based on a fixed threshold (often 0.5 for logistic regression), while ROC AUC evaluates performance across all thresholds. A high accuracy might result from a single threshold coinciding with the dataset distribution, but low ROC AUC indicates poor performance across alternative decision thresholds.
Techniques to Reconcile Discrepancies
• Class Re-balancing: Adjust training data so the model doesn't become biased towards the majority class. Methods include oversampling, undersampling, or generating synthetic samples of the minority class.
• Evaluation Metrics: Focus on metrics such as F1-score, precision, recall, and specifically ROC AUC in imbalanced scenarios.
• Customized Thresholds: Tailor the decision threshold to find a good balance between sensitivity and specificity, potentially improving both model accuracy and AUC score in a more holistic performance view.
Key Takeaways
Understanding the dynamics between accuracy and ROC AUC can help mitigate misconceptions regarding model performance. Here’s a summary of the key points:
| Metric | Definition | Use Case | Limitations |
| Accuracy | Ratio of correct predictions to total predictions | Good for balanced datasets Simple to compute | Misleading in imbalanced settings Threshold-dependent |
| ROC AUC | Area under the ROC curve | Evaluates all thresholds Good discriminative ability | May be complex to understand; Sensitive to class distribution |
In conclusion, always evaluate the context and the specific problem characteristics when choosing the right metric. Balancing the strengths and weaknesses of different evaluation measures is critical for building reliable and robust machine learning models.

