Macro VS Micro VS Weighted VS Samples F1 Score
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Understanding Different Types of F1 Scores
In the domain of classification problems, especially when dealing with imbalanced datasets, simply measuring accuracy can be misleading. A better approach is to use the F1 score, a metric that considers both precision and recall to provide a single value indicating the quality of a model's predictions. However, in multi-class and imbalanced scenarios, a singular F1 score might not suffice, leading to the advent of various F1 score variants such as Macro, Micro, Weighted, and Samples F1 scores. Each of these scores offers unique insights into the performance of a classification model.
Technical Background
Before delving into variants, let's lay the foundation with basic definitions:
- Precision measures the number of true positive observations out of all the predicted positive observations. It's calculated as:
- Recall (also known as Sensitivity) identifies how many of the actual positive observations our model captured through labelling them as positive. Mathematically, it's:
- F1 Score is the harmonic mean of precision and recall, ensuring that both metrics are one-to-one weighted:
Variants of F1 Score
1. Macro F1 Score
Macro F1 Score computes the F1 score independently for each class and then averages the values. This treats all classes equally without taking into account their frequency in the dataset.
- Use Case: Useful when interested in overall system performance on all classes equally, especially when classes are balanced.
- Calculation: For
Cclasses:
2. Micro F1 Score
Micro F1 Score calculates the global precision and recall by considering the sum of true positives, false negatives, and false positives across all classes. It gives equal weight to each instance.
- Use Case: Effective when focusing on average performance per instance and when classes are of uneven sizes.
- Calculation:
3. Weighted F1 Score
Weighted F1 Score considers the contribution of each class to the F1 score weighted by the support, i.e., number of true instances for each class.
- Use Case: Suitable for datasets with class imbalance, giving importance according to the presence of each class.
- Calculation:
4. Samples F1 Score
Samples F1 treats each sample equally, calculating the F1 score for each instance and then averaging these scores. This form of measurement is primarily used with multi-label problems.
- Use Case: Useful when dealing with multi-label classification tasks.
- Calculation: Compute F1 for each sample, then average over all samples:
where is the number of samples.
Summary Table
| F1 Score Type | Calculation | Use Case |
| Macro | Equal class importance; Balanced classes | |
| Micro | Focus on per-instance performance; Imbalanced sizes | |
| Weighted | Imbalanced dataset importance; Real-world class distribution | |
| Samples | Multi-label tasks; Equal importance of each sample |
Additional Considerations
- Choice of Metric: The choice of F1 score variant greatly depends on the specific context of the application. When working with highly imbalanced datasets or when certain classes are more critical, the weighted F1 score will provide insights that a Macro F1 will not.
- Computation Tools: Scikit-learn, a popular library in Python for machine learning, provides direct functions to compute these scores (
precision_recall_fscore_support), making it easier to evaluate classification models in practice. - Interpreting Scores: A nuanced understanding of each score type can guide system improvement decisions. Lower macro F1 in a balanced system might suggest issues in handling particular classes, while a low micro F1 might suggest overall system deficiencies.
Conclusion
Grasping these variations of the F1 score allows practitioners to diagnose and rectify the weak points of their models better. As machine learning diversifies into different fields with unique requirements, selecting the proper evaluation metric becomes pivotal for deriving meaningful insights and fostering model performance.

