Data Prediction using Decision Tree of rpart

Decision Tree

rpart

Data Prediction

Machine Learning

Data Analysis

Data Prediction using Decision Tree of rpart

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

The rpart package is one of the standard ways to build decision trees in R. It works for both classification and regression, and it is especially useful when you want an interpretable model that shows which splits lead to a prediction.

What `rpart` Does Well

A decision tree repeatedly splits a dataset into smaller groups based on feature values. For classification, the goal is to separate classes. For regression, the goal is to reduce prediction error inside each branch.

rpart is a good first choice when you need:

a model that is easy to explain
support for numeric and categorical predictors
simple baseline predictions before trying more complex models

The main tradeoff is that a single tree can overfit if you let it grow too deep.

Fit a Classification Tree

The iris dataset is a convenient example because it is built into R. The code below splits the data, trains a classifier, and predicts on a test set.

1library(rpart)
2
3set.seed(42)
4index <- sample(seq_len(nrow(iris)), size = 0.7 * nrow(iris))
5train <- iris[index, ]
6test <- iris[-index, ]
7
8model <- rpart(
9  Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
10  data = train,
11  method = "class",
12  control = rpart.control(cp = 0.01, minsplit = 10)
13)
14
15pred_class <- predict(model, newdata = test, type = "class")
16confusion <- table(Actual = test$Species, Predicted = pred_class)
17print(confusion)

Important pieces of the call:

the formula puts the target on the left side
'method = "class" selects classification'
'cp limits unnecessary splits'
'type = "class" returns the predicted label instead of probabilities'

Inspect Probabilities and the Tree Structure

For some tasks, the predicted class is not enough. You may want class probabilities to choose your own threshold or inspect uncertainty.

1pred_prob <- predict(model, newdata = test, type = "prob")
2head(pred_prob)
3printcp(model)
4plotcp(model)

printcp shows the complexity parameter table. This helps you see whether the tree has grown beyond the point where additional splits improve cross-validated error.

You can also visualize the tree itself:

plot(model, uniform = TRUE, margin = 0.1)
text(model, use.n = TRUE, cex = 0.8)

That view is one of the biggest reasons people still use decision trees in teaching and operational reporting.

Fit a Regression Tree

If the target is numeric, switch the method to regression. Here is a runnable example using the mtcars dataset to predict fuel efficiency.

1library(rpart)
2
3set.seed(7)
4index <- sample(seq_len(nrow(mtcars)), size = 0.75 * nrow(mtcars))
5train <- mtcars[index, ]
6test <- mtcars[-index, ]
7
8reg_model <- rpart(
9  mpg ~ wt + hp + cyl + disp,
10  data = train,
11  method = "anova",
12  control = rpart.control(cp = 0.01, minsplit = 4)
13)
14
15pred_mpg <- predict(reg_model, newdata = test)
16rmse <- sqrt(mean((test$mpg - pred_mpg)^2))
17print(rmse)

For regression trees, method = "anova" is the usual setting.

Pruning Usually Improves Prediction

A full tree often memorizes noise. rpart supports pruning so you can cut back weak splits.

1best_cp <- reg_model$cptable[which.min(reg_model$cptable[, "xerror"]), "CP"]
2pruned_model <- prune(reg_model, cp = best_cp)
3
4pred_pruned <- predict(pruned_model, newdata = test)
5pruned_rmse <- sqrt(mean((test$mpg - pred_pruned)^2))
6print(pruned_rmse)

This is a useful habit even for small models. A slightly smaller tree is often easier to explain and generalizes better.

Choose Predictors Carefully

rpart will happily fit a model with poor features, but tree quality still depends on the data you give it. Before training:

remove obvious leakage variables
handle missing values deliberately
make sure categorical values are consistent between train and test data
evaluate with held-out data, not training accuracy alone

A tree that looks interpretable can still be wrong for operational reasons if the feature pipeline is inconsistent.

When a Tree Is a Good Baseline

A single decision tree is often strong enough when:

the dataset is modest in size
interpretability matters more than maximum accuracy
you need a quick baseline before random forest or boosting

If performance is not good enough, move to ensemble methods after you understand what the single tree is doing.

Common Pitfalls

Using training accuracy as the main metric. Trees can overfit quickly, so always test on unseen data.

Skipping pruning. The default fitted tree is not always the best final tree.

Using the wrong type in predict. Classification often needs "class" or "prob", while regression returns numeric values directly.

Ignoring class imbalance. A model can appear accurate while still failing on the minority class.

Feeding inconsistent factor levels into prediction data. Train and test categorical values must line up.

Summary

'rpart builds interpretable decision trees for classification and regression in R.'
Use method = "class" for labels and method = "anova" for numeric targets.
Evaluate on held-out data, not just the training set.
Inspect the complexity parameter table and prune when needed.
Treat a single tree as a useful baseline before moving to more complex models.

Data Prediction using Decision Tree of rpart

Master System Design with Codemia

Introduction

What rpart Does Well

Fit a Classification Tree

Inspect Probabilities and the Tree Structure

Fit a Regression Tree

Pruning Usually Improves Prediction

Choose Predictors Carefully

When a Tree Is a Good Baseline

Common Pitfalls

Summary

What `rpart` Does Well