Scikit-learn χ² chi-squared statistic and corresponding contingency table
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Scikit-learn provides an efficient implementation of statistical models and machine learning algorithms, facilitating various tasks such as classification, regression, clustering, and data preprocessing. Among the library's many tools is the chi-squared () statistic, an essential method for categorical data analysis, especially in feature selection.
Chi-Squared () Statistic
The chi-squared test is a non-parametric statistical method used to determine if there is a significant difference between observed and expected frequencies in categorical datasets. This often helps to infer relationships between categorical variables in a dataset, which is crucial in feature selection for machine learning models.
Technical Explanation
The chi-squared statistic is calculated as follows:
where: • is the observed frequency count in the i-th category, • is the expected frequency count in the i-th category, based on the hypothesis of independence.
In a feature selection context, a high chi-squared value indicates that the observed frequency is far from the expected frequency under the independence assumption, suggesting a potential relationship between the feature and the target.
Contingency Table
A contingency table, or cross-tabulation, displays the frequency distribution of variables. It is an essential tool in calculating the chi-squared statistic. It allows visualization of the interaction between two categorical variables and can be illustrated with a simple table for binary classifications or extended to tables for multi-class problems.
Example Contingency Table
Consider a dataset predicting whether individuals will purchase a product based on their gender and purchase status:
| Purchase / Gender | Male | Female | Total |
| Yes | 60 | 90 | 150 |
| No | 40 | 60 | 100 |
| Total | 100 | 150 | 250 |
From this table, we can calculate expected frequencies for each cell assuming independence. For example, the expected frequency of males who purchased is:
Chi-Squared Feature Selection in Scikit-learn
Scikit-learn provides an intuitive method to perform chi-squared tests via the `SelectKBest` function paired with `chi2`, making it useful during the feature selection phase in predictive modeling:
• ANOVA F-value: For continuous target variables. • Mutual Information: Captures more complex relationships between variables. • Tree-based feature importance: Model-based approach providing importance scores.

