Data Prediction using Decision Tree of rpart
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
The rpart package is one of the standard ways to build decision trees in R. It works for both classification and regression, and it is especially useful when you want an interpretable model that shows which splits lead to a prediction.
What rpart Does Well
A decision tree repeatedly splits a dataset into smaller groups based on feature values. For classification, the goal is to separate classes. For regression, the goal is to reduce prediction error inside each branch.
rpart is a good first choice when you need:
- a model that is easy to explain
- support for numeric and categorical predictors
- simple baseline predictions before trying more complex models
The main tradeoff is that a single tree can overfit if you let it grow too deep.
Fit a Classification Tree
The iris dataset is a convenient example because it is built into R. The code below splits the data, trains a classifier, and predicts on a test set.
Important pieces of the call:
- the formula puts the target on the left side
- '
method = "class"selects classification' - '
cplimits unnecessary splits' - '
type = "class"returns the predicted label instead of probabilities'
Inspect Probabilities and the Tree Structure
For some tasks, the predicted class is not enough. You may want class probabilities to choose your own threshold or inspect uncertainty.
printcp shows the complexity parameter table. This helps you see whether the tree has grown beyond the point where additional splits improve cross-validated error.
You can also visualize the tree itself:
That view is one of the biggest reasons people still use decision trees in teaching and operational reporting.
Fit a Regression Tree
If the target is numeric, switch the method to regression. Here is a runnable example using the mtcars dataset to predict fuel efficiency.
For regression trees, method = "anova" is the usual setting.
Pruning Usually Improves Prediction
A full tree often memorizes noise. rpart supports pruning so you can cut back weak splits.
This is a useful habit even for small models. A slightly smaller tree is often easier to explain and generalizes better.
Choose Predictors Carefully
rpart will happily fit a model with poor features, but tree quality still depends on the data you give it. Before training:
- remove obvious leakage variables
- handle missing values deliberately
- make sure categorical values are consistent between train and test data
- evaluate with held-out data, not training accuracy alone
A tree that looks interpretable can still be wrong for operational reasons if the feature pipeline is inconsistent.
When a Tree Is a Good Baseline
A single decision tree is often strong enough when:
- the dataset is modest in size
- interpretability matters more than maximum accuracy
- you need a quick baseline before random forest or boosting
If performance is not good enough, move to ensemble methods after you understand what the single tree is doing.
Common Pitfalls
Using training accuracy as the main metric. Trees can overfit quickly, so always test on unseen data.
Skipping pruning. The default fitted tree is not always the best final tree.
Using the wrong type in predict. Classification often needs "class" or "prob", while regression returns numeric values directly.
Ignoring class imbalance. A model can appear accurate while still failing on the minority class.
Feeding inconsistent factor levels into prediction data. Train and test categorical values must line up.
Summary
- '
rpartbuilds interpretable decision trees for classification and regression in R.' - Use
method = "class"for labels andmethod = "anova"for numeric targets. - Evaluate on held-out data, not just the training set.
- Inspect the complexity parameter table and prune when needed.
- Treat a single tree as a useful baseline before moving to more complex models.

