Macro VS Micro VS Weighted VS Samples F1 Score

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Understanding Different Types of F1 Scores

In the domain of classification problems, especially when dealing with imbalanced datasets, simply measuring accuracy can be misleading. A better approach is to use the F1 score, a metric that considers both precision and recall to provide a single value indicating the quality of a model's predictions. However, in multi-class and imbalanced scenarios, a singular F1 score might not suffice, leading to the advent of various F1 score variants such as Macro, Micro, Weighted, and Samples F1 scores. Each of these scores offers unique insights into the performance of a classification model.

Technical Background

Before delving into variants, let's lay the foundation with basic definitions:

Precision measures the number of true positive observations out of all the predicted positive observations. It's calculated as: $\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$
Recall (also known as Sensitivity) identifies how many of the actual positive observations our model captured through labelling them as positive. Mathematically, it's: $\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$
F1 Score is the harmonic mean of precision and recall, ensuring that both metrics are one-to-one weighted: $\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$

Variants of F1 Score

1. Macro F1 Score

Macro F1 Score computes the F1 score independently for each class and then averages the values. This treats all classes equally without taking into account their frequency in the dataset.

Use Case: Useful when interested in overall system performance on all classes equally, especially when classes are balanced.
Calculation: For C classes: $\text{Macro F1} = \frac{1}{C} \sum_{i=1}^{C} \text{F1}_i$

2. Micro F1 Score

Micro F1 Score calculates the global precision and recall by considering the sum of true positives, false negatives, and false positives across all classes. It gives equal weight to each instance.

Use Case: Effective when focusing on average performance per instance and when classes are of uneven sizes.
Calculation: $\text{Micro Precision} = \frac{\sum \text{True Positives}}{\sum (\text{True Positives} + \text{False Positives})}$
$\text{Micro Recall} = \frac{\sum \text{True Positives}}{\sum (\text{True Positives} + \text{False Negatives})}$
$\text{Micro F1} = 2 \cdot \frac{\text{Micro Precision} \times \text{Micro Recall}}{\text{Micro Precision} + \text{Micro Recall}}$

3. Weighted F1 Score

Weighted F1 Score considers the contribution of each class to the F1 score weighted by the support, i.e., number of true instances for each class.

Use Case: Suitable for datasets with class imbalance, giving importance according to the presence of each class.
Calculation: $\text{Weighted F1} = \sum_{i=1}^{C} \frac{\text{Support}_i}{\text{Total Support}} \cdot \text{F1}_i$

4. Samples F1 Score

Samples F1 treats each sample equally, calculating the F1 score for each instance and then averaging these scores. This form of measurement is primarily used with multi-label problems.

Use Case: Useful when dealing with multi-label classification tasks.
Calculation: Compute F1 for each sample, then average over all samples: $\text{Samples F1} = \frac{1}{N} \sum_{j=1}^{N} \text{F1}_j$
where $N$ is the number of samples.

Summary Table

F1 Score Type	Calculation	Use Case
Macro	$\text{Macro F1} = \frac{1}{C} \sum_{i=1}^{C} \text{F1}_i$	Equal class importance; Balanced classes
Micro	$\text{Micro F1} = 2 \cdot \frac{\text{Micro Precision} \times \text{Micro Recall}}{\text{Micro Precision} + \text{Micro Recall}}$	Focus on per-instance performance; Imbalanced sizes
Weighted	$\text{Weighted F1} = \sum_{i=1}^{C} \frac{\text{Support}_i}{\text{Total Support}} \cdot \text{F1}_i$	Imbalanced dataset importance; Real-world class distribution
Samples	$\text{Samples F1} = \frac{1}{N} \sum_{j=1}^{N} \text{F1}_j$	Multi-label tasks; Equal importance of each sample

Additional Considerations

Choice of Metric: The choice of F1 score variant greatly depends on the specific context of the application. When working with highly imbalanced datasets or when certain classes are more critical, the weighted F1 score will provide insights that a Macro F1 will not.
Computation Tools: Scikit-learn, a popular library in Python for machine learning, provides direct functions to compute these scores (precision_recall_fscore_support), making it easier to evaluate classification models in practice.
Interpreting Scores: A nuanced understanding of each score type can guide system improvement decisions. Lower macro F1 in a balanced system might suggest issues in handling particular classes, while a low micro F1 might suggest overall system deficiencies.

Conclusion

Grasping these variations of the F1 score allows practitioners to diagnose and rectify the weak points of their models better. As machine learning diversifies into different fields with unique requirements, selecting the proper evaluation metric becomes pivotal for deriving meaningful insights and fostering model performance.