How to extract sklearn decision tree rules to pandas boolean conditions?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
Decision trees are a popular machine learning model due to their interpretability and ease of visualization. However, sometimes it becomes necessary to extract the decision rules in a form that can be understood programmatically, such as obtaining Boolean conditions for further analysis or integration into systems that can't accept tree-based models directly. In this article, we will explore how to extract decision tree rules from a decision tree created using `sklearn` and convert them into a set of Boolean conditions that can be represented in a pandas DataFrame.
Prerequisites
To follow along, you'll need:
- Python environment set up with `pandas`, `numpy`, and `scikit-learn` packages installed.
- Basic understanding of how decision trees work and familiarity with pandas and numpy.
Technical Explanation
Decision Trees in Scikit-learn
Scikit-learn provides a user-friendly interface for creating decision trees through the `DecisionTreeClassifier` and `DecisionTreeRegressor` classes. These objects, once trained, contain the tree structure which we can traverse to obtain rules.
Structure of Decision Trees in Scikit-learn
Under the hood, a decision tree is represented by a set of nodes. Each node either splits into further nodes or ends as a leaf node. The decision tree model in scikit-learn is stored as a binary tree with the following attributes:
- `tree_.feature`: The feature index used for splitting at each node. If it's `-2`, then it's a leaf node.
- `tree_.threshold`: The threshold value for the feature used to make a split.
- `tree_.children_left`: The indices of the left children of nodes.
- `tree_.children_right`: The indices of the right children of nodes.
- `tree_.value`: The predicted class or regression values at each node.
Extracting Rules
To extract the rules, we'll traverse the decision tree and record the rules leading to each leaf. Here's a step-by-step guide.
- Traverse the Tree: Use a recursive function to visit each node.
- Capture Condition: For each node, capture the splitting rule in terms of a feature and its threshold.
- Record Path: Record the path to each leaf, which forms a Boolean condition derived from the captured rules.
- Build Pandas DataFrame: Using paths from tree traversal, construct Boolean conditions and store them in a DataFrame.
Implementation
- Load and Train: Load an example dataset and train a decision tree model.
- Recursive Function: `traverse_tree()` iterates over nodes storing rules until a leaf node is reached.
- Path Conditions: For each leaf, print the path as a conjunction of Boolean conditions that define the rule leading to that output.

