A Guide to Using Caret in R (2023)

Learn how to work with the caret (Classification and Regression Training) package using R

A Guide to Using Caret in R (1)

Caret is a pretty powerful machine learning library in R. With flexibility as its main feature, caretenables you to train different types of algorithms using a simple trainfunction. This layer of abstraction provides a common interface to train models in R, just by tweaking an argument — the method.

caret(for Classification and Regression Training) is one of the most popular machine learning libraries in R. Adding to the flexibility, we get embedding hyperparameter tuning and cross validation — two techniques that will improve our algorithm’s generalization power.

In this guide, we are going to explore the package in four different dimensions:

  • We’ll start by learning how to train different models by changing the method argument. With each algorithm we’ll train, we’ll be exposed to new concepts such as hyperparameter tuning, cross-validation, factors and other meaningful details.
  • Then, we’ll learn how to setup our own custom cross-validation function followed by tweaking our algorithms for different optimization metrics.
  • Experimenting with our own hyperparameter tuning.
  • We’ll wrap everything by checking how the predict works with different caretmodels.

For simplicity, and because we want to focus on the library itself, we’ll use two of the most famous toy datasets available in R:

  • The iris dataset, a very well known dataset that will represent our classification task.
  • The mtcars dataset that will be used as our regression task.

I’ll do a small tweak to make the iris problem a binary one (unfortunately glm, the logistic regression implementation in R doesn’t support multiclass problems):

iris$target <- ifelse(
iris$Species == 'setosa',
1,
0
)

As we’ll want to evaluate the predict method, let’s split our two dataframes into train and test first — I’ll use a personal favorite, caTools:

library(caTools)# Train Test Split on both Iris and Mtcars
train_test_split <- function(df) {
set.seed(42)
sample = sample.split(df, SplitRatio = 0.8)
train = subset(df, sample == TRUE)
test = subset(df, sample == FALSE)
return (list(train, test))
}
# Unwrapping mtcars
mtcars_train <- train_test_split(mtcars)[[1]]
mtcars_test <- train_test_split(mtcars)[[2]]
# Unwrapping iris
iris_train <- train_test_split(iris)[[1]]
iris_test <- train_test_split(iris)[[2]]

All set, let’s start!

Our first caret example consists of fitting a simple linear regression. Linear regression is one of the most well known algorithms. In base R, you can fit one using the lmmethod. In caret , you can do the same using:

lm_model <- train(mpg ~ hp + wt + gear + disp, 
data = mtcars_train,
method = "lm")

train is the central function of the caret library. It fits the algorithm into the data using the specified method . In this case, we are fitting a regression line on the mtcars data frame, trying to predict the miles-per-gallon using the car’s horsepower, weight, number of gears and engine displacement.

Normally the method is deeply tied with the standalone function you would use to train models. For example, method='lm' will produce something similar to the lmfunction. Cool thing, we can treat the caretgenerated lm_modelas our other linear models in R — calling summary is possible:

summary(lm_model)Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-3.6056 -1.8792 -0.4769 1.0658 5.6892
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 31.67555 6.18241 5.123 7.11e-05 ***
hp -0.04911 0.01883 -2.609 0.01777 *
wt -3.89388 1.28164 -3.038 0.00707 **
gear 1.49329 1.29311 1.155 0.26327
disp 0.01265 0.01482 0.854 0.40438
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.915 on 18 degrees of freedom
Multiple R-squared: 0.8246, Adjusted R-squared: 0.7856
F-statistic: 21.15 on 4 and 18 DF, p-value: 1.326e-06

If you fit a linear model using lm(mpg ~ hp + wt + gear + disp, data = mtcars_train) you’ll obtain exactly the same coefficients. So.. what is, exactly, the advantage of using caret ?

One of the most important is what we’ll see in the next section — changing models is sooo easy!

To change between models in caret, we just have to change the method inside our train function — let’s fit a logistic regression on our iris data frame:

glm_model <- train(target ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, 
data = iris_train,
method = "glm",
family = "binomial")

Using method='glm'we open the possibility of training a logistic regression. Notice that I can also access other arguments of the method by passing them into the train function — family='binomial' is the parameter we can use to let R know that we want to train a logistic regression. Extending parameters so easily inside the trainfunction depending on the method we are using, is another great feature of the caret library.

We can, of course, see the logistic regression result with summary:

summary(glm_model)Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-2.487e-05 -2.110e-08 -2.110e-08 2.110e-08 1.723e-05
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 45.329 575717.650 0 1
Sepal.Length -5.846 177036.957 0 1
Sepal.Width 11.847 88665.772 0 1
Petal.Length -16.524 126903.905 0 1
Petal.Width -7.199 185972.824 0 1
(Dispersion parameter for binomial family taken to be 1)Null deviance: 1.2821e+02 on 99 degrees of freedom
Residual deviance: 1.7750e-09 on 95 degrees of freedom
AIC: 10
Number of Fisher Scoring iterations: 25

Don’t worry too much about the results right now — the important part is that you understand how easy it was to switch between models using the train function.

A jump from linear to logistic regression doesn’t seem so impressive.. are we just restricted to linear models? Nope! Let’s see some tree based models, next.

We also have the availability to fit tree-based models using train just by switching the method as we’ve done when we switched between linear and logistic regression:

d.tree <- train(mpg ~ hp + wt + gear + disp, 
data = mtcars_train,
method = "rpart")

The cool thing? This is an rpart object — we can even plot it using rpart.plot :

library(rpart.plot)
rpart.plot(d.tree$finalModel)
A Guide to Using Caret in R (2)

To plot the decision tree, we just need to access the finalModelobject of d.tree , that is a mimic of therpartcounterpart.

A significant difference between caret or using the rpart function as a stand alone, is that the latter will not perform any hyperparameter tuning. On the other hand, caret’s rpartalready performs some mini “hyperparameter tuning” with the complexity parameter (cp ) as we can see when we detail our d.tree:

d.treeCART23 samples
4 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 23, 23, 23, 23, 23, 23, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.0000000 4.805918 0.4932426 3.944907
0.3169698 4.805918 0.4932426 3.944907
0.6339397 4.953153 0.5061275 4.105045
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.3169698.

In our case, the lowest RMSE (Root Mean Squared Error) ended up being between cp= 0 or cp= 0.32 and caretchose one of those values as the final one for our decision tree. We can also perform our custom hyperparameter tuning inside the train function something we will see a bit later in this post!

To fit a randomForest, there are several methods we can use — personally, I enjoy using the rangerimplementation by providing that in the argument of the train function:

r.forest <- train(mpg ~ hp + wt + gear + disp, 
data = mtcars_train,
method = "ranger")

Let’s see our r.forestobject:

r.forestRandom Forest23 samples
4 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 23, 23, 23, 23, 23, 23, ...
Resampling results across tuning parameters:
mtry splitrule RMSE Rsquared MAE
2 variance 2.672097 0.8288142 2.160861
2 extratrees 2.844143 0.8054631 2.285504
3 variance 2.700335 0.8212250 2.189822
3 extratrees 2.855688 0.8024398 2.295482
4 variance 2.724485 0.8144731 2.213220
4 extratrees 2.892709 0.7918827 2.337236
Tuning parameter 'min.node.size' was held constant at a value of 5
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 2, splitrule =
variance and min.node.size = 5.

Just like in our Decision Tree example, caretis also performing some simple hyperparameter tuning on it’s own before choosing the final model to use. In this case, trainperforms a 2 argument hyperparameter tuning with mtryand splitrule. It chose mtry= 2 and splitrule= variance as the final model.

Notice that the hyperparameters that the train function is using are deeply tied with the ones available in the method !

Keep in mind that this is the first train example where we’ve used a methodthat is not native to R — when you try to run the code above, R will prompt you to install the rangerlibrary, if you don’t have it. That’s another cool caret detail — we can use a lot of models that are not available in base R.

Bottom line is, the trainfunction is just this high level API that manages everything for us, regarding model training, with a common interface.

One of the most famous types of models used in Kaggle competitions are gradient boosting models. These types of tree based models are famous for their stability and performance.

XGBoost is one implementation of these boosting models that rely on model’s errors to improve their performance. As with Random Forest, we can access different boosting implementations in caret — here’s a small example using the xbgTreelibrary:

xg.boost <- train(mpg ~ hp + wt + gear + disp, 
data = mtcars_train,
method = "xgbTree")

The xgbTree implementation performs some extensive tuning (it may take a while to run) — if we look at the output:

xg.boosteXtreme Gradient Boosting23 samples
4 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 23, 23, 23, 23, 23, 23, ...
Resampling results across tuning parameters:
eta max_depth colsample_bytree subsample nrounds RMSE
0.3 1 0.6 0.50 50 2.946174
0.3 1 0.6 0.50 100 2.944830
0.3 1 0.6 0.50 150 2.962090
0.3 1 0.6 0.75 50 3.112695
0.3 1 0.6 0.75 100 3.099010
0.3 1 0.6 0.75 150 3.110219
0.3 1 0.6 1.00 50 3.077528
0.3 1 0.6 1.00 100 3.097813
0.3 1 0.6 1.00 150 3.109069
0.3 1 0.8 0.50 50 3.097415
0.3 1 0.8 0.50 100 3.097322
0.3 1 0.8 0.50 150 3.098146
0.3 1 0.8 0.75 50 3.078441
0.3 1 0.8 0.75 100 3.120153
0.3 1 0.8 0.75 150 3.124199
0.3 1 0.8 1.00 50 3.174089
0.3 1 0.8 1.00 100 3.194984
...
Tuning parameter 'gamma' was held constant at a value of 0Tuning parameter 'min_child_weight' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nrounds = 50, max_depth = 3,
eta = 0.3, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1
and subsample = 1.

I’m just showcasing a couple of examples of the full hyperparameter tuning that train performed. Notice how we could, just with a single tweak, train three different types of tree-based models without any issue. No complicated arguments, no hard interfaces between the data and the training, nothing! caret took care of everything for us.

Let’s see two more models before checking cross-validation, hyperparameter tuning and metrics!

We also have k-nearest neighbor algorithms available on our method:

knn <- train(mpg ~ hp + wt + gear + disp, 
data = mtcars_train,
method = "knn")

For knn , caret looks into different k ‘s to select a final model:

knnk-Nearest Neighbors23 samples
4 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 23, 23, 23, 23, 23, 23, ...
Resampling results across tuning parameters:
k RMSE Rsquared MAE
5 3.541489 0.7338904 2.906301
7 3.668886 0.7033751 2.909202
9 3.868597 0.6580107 2.965640
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 5.

caretalso enables us to train our own neural networks — although you’d be better off by using other packages (h20, keras, etc.) you can also train one in caretusing nnet, neuralnetor mxnet methods— here’s an example using neuralnet

neural.network <- train(mpg ~ hp + wt + gear + disp, 
data = mtcars_train,
method = "neuralnet")

And seeing the result of our network:

Neural Network23 samples
4 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 23, 23, 23, 23, 23, 23, ...
Resampling results across tuning parameters:
layer1 RMSE Rsquared MAE
1 5.916693 0.5695443 4.854666
3 5.953915 0.2311309 4.904835
5 5.700600 0.4514841 4.666083
Tuning parameter 'layer2' was held constant at a value of 0Tuning parameter 'layer3' was held constant at a value of 0
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were layer1 = 5, layer2 = 0 and
layer3 = 0.

One main issue with nnetis that it is not suited for classification tasks— for a methodthat is suited for that we can use nnet, another implementation:

neural.network.class <- train(target ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, 
data = iris_train,
method = "nnet")

If we check the neural.network.class something odd shows up:

neural.network.classNeural Network100 samples
4 predictor
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 100, 100, 100, 100, 100, 100, ...
Resampling results across tuning parameters:
size decay RMSE Rsquared MAE
1 0e+00 0.153228973 0.9370545 0.106767943
1 1e-04 0.074206333 0.9759492 0.052561227
1 1e-01 0.116518767 0.9977164 0.112402210
3 0e+00 0.229154294 0.9616348 0.124433888
3 1e-04 0.027172779 0.9887259 0.021662284
3 1e-01 0.080126891 0.9956402 0.074390107
5 0e+00 0.098585595 0.9999507 0.060258715
5 1e-04 0.004105796 0.9999710 0.003348211
5 1e-01 0.073134836 0.9944261 0.065199393
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were size = 5 and decay = 1e-04.

Our neural network is still treating this as a regression problem (according to the metrics it is trying to optimize) — why?

Because nnet needs the target to be a factor to understand that we are trying to do a classification task — this is one of the gotchas when using caret — you still need to understand some portion of specific methods to avoid any bugs in your code. Let’s turn our target into a factor column:

iris_train$target = factor(iris_train$target)

And now fit our neural network again:

neural.network.class.2 <- train(target ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, 
data = iris_train,
method = "nnet")

And viewing our object neural.network.class.2 :

Neural Network100 samples
4 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 100, 100, 100, 100, 100, 100, ...
Resampling results across tuning parameters:
size decay Accuracy Kappa
1 0e+00 0.9591083 0.8776744
1 1e-04 0.9507270 0.8400000
1 1e-01 1.0000000 1.0000000
3 0e+00 0.9845919 0.9474580
3 1e-04 1.0000000 1.0000000
3 1e-01 1.0000000 1.0000000
5 0e+00 0.9988889 0.9969231
5 1e-04 1.0000000 1.0000000
5 1e-01 1.0000000 1.0000000
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were size = 1 and decay = 0.1.

Cool! Our object is now showing Accuracy , a metric that is only relevant for classification tasks. What parameters is caret modifying in the tuning process? size and decay !

We got a good coverage on different methods here — you may be asking, did we left out some of them? Yes, we did — at the time that I’m writing this, about 231 methods! Feel free to experiment with some of them!

In caret you can also give your custom cross-validation method to the train function. For instance, let’s use a k-fold cross validation on a decision tree in the example below:

ctrl<- trainControl(method="cv", number=10)d.tree.kfold <- train(mpg ~ hp + wt + gear + disp, 
data = mtcars_train,
trControl = ctrl,
method = "rpart")

So.. what’s new? Two things:

  • We are defining a ctrl object with the function trainControl . In the function, we are defining the method of our experiment as cv (cross-validation) with number equals to 10, number is the same as k.
  • We are passing our k-fold cross validation to thetrControl argument inside the train . This will let caret understand that we want to apply this cross validation method inside our training process.

Let’s check our decision tree:

dtreeCART23 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 21, 20, 21, 21, 21, 21, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.0000000 5.717836 0.7373485 4.854541
0.3169698 5.717836 0.7373485 4.854541
0.6339397 6.071459 0.6060227 5.453586
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.3169698.

Notice that now we can see the results regarding an evaluation on a Resampling: Cross-Validated (10 fold) , confirming that we are using a different cross-validation method.

We can also choose our algorithms’ result according to different metrics. For example, in the output of our Decision Tree, take a deeper look into the final paragraph:

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.3169698.

Our algorithm was optimizing for the root mean squared error. What if we want to optimize for another metric, such as R-Squared? We can do that with the argument metric :

d.tree.kfold.rsquared <- train(mpg ~ hp + wt + gear + disp, 
data = mtcars_train,
trControl = ctrl,
metric = "Rsquared",
method = "rpart")

If we check the output, we have the following:

d.tree.kfold.rsquaredCART23 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 21, 21, 21, 21, 21, 20, ...
Resampling results across tuning parameters:
cp RMSE Rsquared MAE
0.0000000 4.943094 0.8303833 4.342684
0.3169698 4.943094 0.8303833 4.342684
0.6339397 6.022911 0.7031709 5.472432
Rsquared was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.3169698.

In this example, this does not make any difference as the result from the three metrics (RMSE, Rsquared and MAE) will always point to the same cp . In the case where you have more hyperparameters, that is not true and they may differ.

Nevertheless, it’s really cool that we can choose different model parameters just by tweaking the metric argument! Check all the metrics you can use here.

Although caret will do some default hyperparameter tuning by itself, it will normally be a bit simplistic for most machine learning tasks.

For example, what if I want to check other cp parameters? We can give that using the tuneGrid argument in our train function:

d.tree.hyperparam <- train(mpg ~ hp + wt + gear + disp, 
data = mtcars_train,
trControl = ctrl,
metric = "Rsquared",
method = "rpart",
tuneGrid = data.frame(
cp = c(0.0001, 0.001, 0.35, 0.65)))

By passing a data.frame inside the tuneGrid , I can optimize for custom cp inside my decision tree. One of the shortcomings of caret is that other parameters (such as maxdepth, minsplit or others) are not tunable via the train function for the rpart method. You can view all parameters you can tune for each method by visiting the available models page on caret’s official documentation.

An example of tuning several parameters using the ranger method:

r.forest.hyperparam <- train(mpg ~ hp + wt + gear + disp, 
data = mtcars_train,
trControl = ctrl,
metric = "Rsquared",
method = "ranger",
tuneGrid = data.frame(
mtry = c(2, 3, 4),
min.node.size = c(2, 4, 10),
splitrule = c('variance')))

This will produce the following output:

Random Forest23 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 21, 20, 21, 21, 21, 20, ...
Resampling results across tuning parameters:
mtry min.node.size RMSE Rsquared MAE
2 2 2.389460 0.9377971 2.192705
3 4 2.325312 0.9313336 2.112829
4 10 2.723303 0.9147017 2.425553
Tuning parameter 'splitrule' was held constant at a value of variance
Rsquared was used to select the optimal model using the largest value.
The final values used for the model were mtry = 2, splitrule =
variance and min.node.size = 2.

As we are keeping variance constant, our random forest will be optimized for mtry and min.node.size . Notice how we’ve extended our hyperparameter tuning to more variables by giving extra columns to the data.frame we feed into tuneGrid .

Finally, how do we predict stuff on new data? Just like we do in other models! Using a predict is as simple as it can gets with caret models — using different models:

# Linear Regression
predict(lm_model, mtcars_test)
# Decision Tree
predict(d.tree, mtcars_test)
# Random Forest
predict(r.forest, mtcars_test)
# XGBoost
predict(xg.boost, mtcars_test)

How cool is that? We can even pass different predictions into a scatter plot to understand how our different models are classifying the same example — comparing the random forest with our xgboost:

library(ggplot2)ggplot(
data = predictions,
aes(x = rf, y = xgboost)
) + geom_point()
A Guide to Using Caret in R (3)

I hope you’ve enjoyed this post and it helps you master this great machine learning library. caret has a lot of many features that makes it a go-to library to build interesting data science models using R Programming.

If you would like to drop by my R courses, feel free to join here (R Programming for Absolute Beginners) or here (Data Science Bootcamp). My courses teach both R Programming and Data Science and I would love to have you around!

A Guide to Using Caret in R (4)

Here’s a small gist with the code we’ve used on this post:

Top Articles
Latest Posts
Article information

Author: Tish Haag

Last Updated: 02/26/2023

Views: 5279

Rating: 4.7 / 5 (47 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Tish Haag

Birthday: 1999-11-18

Address: 30256 Tara Expressway, Kutchburgh, VT 92892-0078

Phone: +4215847628708

Job: Internal Consulting Engineer

Hobby: Roller skating, Roller skating, Kayaking, Flying, Graffiti, Ghost hunting, scrapbook

Introduction: My name is Tish Haag, I am a excited, delightful, curious, beautiful, agreeable, enchanting, fancy person who loves writing and wants to share my knowledge and understanding with you.