scikit-learn
machine learning
accuracy score
classification metrics
Python

Scikit-learn, get accuracy scores for each class

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Scikit-learn is a powerful and widely-used open-source machine learning library for the Python programming language. Built on top of SciPy and designed to inter-operate with NumPy and pandas, Scikit-learn provides simple yet efficient tools for data mining and data analysis, making it accessible and versatile for experts and laypeople alike.

Key Features

Scikit-learn includes a variety of supervised and unsupervised learning algorithms. It is particularly praised for its consistency, as the API is designed with an emphasis on reusable interfaces, performance, and quality of documentation.

Core Functionality

  1. Classification: Identifies which category an object belongs to. Some examples include:
    • K-Nearest Neighbors
    • Support Vector Machines
    • Decision Trees
  2. Regression: Predicts a continuous-valued attribute associated with an object. Examples include:
    • Linear Regression
    • Ridge Regression
    • LASSO
  3. Clustering: Groups sets of similar data. Examples:
    • K-Means
    • DBSCAN
    • Hierarchical clustering
  4. Dimensionality Reduction: Reduces the number of random variables to consider. Examples:
    • Principal Component Analysis (PCA)
    • Singular Value Decomposition (SVD)
  5. Model Selection: Compares, validates, and chooses parameters and models, including:
    • Grid Search
    • Cross-Validation
  6. Preprocessing: Feature extraction and normalization. Examples:
    • StandardScaler
    • MinMaxScaler
    • Polynomial Features

Understanding Model Evaluation with Scikit-learn

A critical component of machine learning is evaluating a model's performance. Scikit-learn provides several utilities, such as metrics, to determine the effectiveness of a model. A common approach especially for classification tasks is to compute accuracy scores.

Accuracy Scores for Each Class

In multi-class classification problems, it's often important to understand how well the model is performing across different classes, not just overall. Here is an example of how to compute accuracy scores for each class using Scikit-learn:

python
1from sklearn.metrics import classification_report
2from sklearn.datasets import make_classification
3from sklearn.model_selection import train_test_split
4from sklearn.linear_model import LogisticRegression
5
6# Generate a random multi-class classification dataset
7X, y = make_classification(n_samples=1000, n_features=20, n_classes=3, n_informative=10)
8
9# Split data into training and test sets
10X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
11
12# Initialize and train the model
13model = LogisticRegression(max_iter=200)
14model.fit(X_train, y_train)
15
16# Predict the labels for the test set
17y_pred = model.predict(X_test)
18
19# Detailed classification report
20report = classification_report(y_test, y_pred, output_dict=True)
21print(report)

Breakdown of the Report

The classification report provides key insights, such as precision, recall, f1-score, and support, for each class:

  • Precision: The ratio of correctly predicted positive observations to the total predicted positives.
  • Recall (Sensitivity): The ratio of correctly predicted positive observations to all the actual positives.
  • F1-Score: The weighted average of Precision and Recall.
  • Support: The number of actual occurrences of the class in the dataset.

Example of Reported Metrics

ClassPrecisionRecallF1-scoreSupport
00.900.850.87100
10.800.750.7795
20.830.900.86105

The table provides a summary of key performance metrics for each class. Understanding these metrics can help in adjusting model parameters or choosing alternative algorithms for better performance.

Advanced Topics in Scikit-learn

Pipelines

Scikit-learn's Pipeline is a simple way to automate workflows by chaining together transformations and estimators. It is particularly useful for ensuring that all preprocessing steps are applied consistently during both training and testing phases.

python
1from sklearn.pipeline import Pipeline
2from sklearn.preprocessing import StandardScaler
3
4# Create a pipeline
5pipeline = Pipeline([
6    ('scaler', StandardScaler()), 
7    ('model', LogisticRegression())
8])
9
10# Fit the model
11pipeline.fit(X_train, y_train)
12
13# Predict
14y_pred = pipeline.predict(X_test)

Grid Search with Cross-Validation

Grid Search combined with Cross-Validation can be used to find the best parameters for a model:

python
1from sklearn.model_selection import GridSearchCV
2
3# Define parameter range
4param_grid = {'model__C': [0.1, 1, 10, 100]}
5
6# Setup the grid search
7grid = GridSearchCV(pipeline, param_grid, refit=True, cv=3)
8
9# Fit the model
10grid.fit(X_train, y_train)
11
12# Best parameters found
13print(grid.best_params_)

Conclusion

Scikit-learn continues to be a versatile tool in the toolkit of anyone working with machine learning in Python. Its consistent, user-friendly API, comprehensive documentation, and robust model selection utilities make it indispensable for practitioners and researchers alike. Through its broad range of features — from preprocessing and model training to hyperparameter optimization — Scikit-learn helps streamline and simplify the complex process of building predictive models.


Course illustration
Course illustration

All Rights Reserved.