Learn how to work with the caret (Classification and Regression Training) package using R
Caret is a pretty powerful machine learning library in R. With flexibility as its main feature, caret
enables you to train different types of algorithms using a simple train
function. This layer of abstraction provides a common interface to train models in R, just by tweaking an argument — the method
.
caret
(for Classification and Regression Training) is one of the most popular machine learning libraries in R. Adding to the flexibility, we get embedding hyperparameter tuning and cross validation — two techniques that will improve our algorithm’s generalization power.
In this guide, we are going to explore the package in four different dimensions:
- We’ll start by learning how to train different models by changing the
method
argument. With each algorithm we’ll train, we’ll be exposed to new concepts such as hyperparameter tuning, cross-validation, factors and other meaningful details. - Then, we’ll learn how to setup our own custom cross-validation function followed by tweaking our algorithms for different optimization metrics.
- Experimenting with our own hyperparameter tuning.
- We’ll wrap everything by checking how the
predict
works with differentcaret
models.
For simplicity, and because we want to focus on the library itself, we’ll use two of the most famous toy datasets available in R:
- The iris dataset, a very well known dataset that will represent our classification task.
- The mtcars dataset that will be used as our regression task.
I’ll do a small tweak to make the iris problem a binary one (unfortunately glm
, the logistic regression implementation in R doesn’t support multiclass problems):
iris$target <- ifelse(
iris$Species == 'setosa',
1,
0
)
As we’ll want to evaluate the predict method, let’s split our two dataframes into train and test first — I’ll use a personal favorite, caTools
:
library(caTools)# Train Test Split on both Iris and Mtcars
train_test_split <- function(df) {
set.seed(42)
sample = sample.split(df, SplitRatio = 0.8)
train = subset(df, sample == TRUE)
test = subset(df, sample == FALSE)
return (list(train, test))
}# Unwrapping mtcars
mtcars_train <- train_test_split(mtcars)[[1]]
mtcars_test <- train_test_split(mtcars)[[2]]# Unwrapping iris
iris_train <- train_test_split(iris)[[1]]
iris_test <- train_test_split(iris)[[2]]
All set, let’s start!
Our first caret
example consists of fitting a simple linear regression. Linear regression is one of the most well known algorithms. In base R, you can fit one using the lm
method. In caret
, you can do the same using:
lm_model <- train(mpg ~ hp + wt + gear + disp,
data = mtcars_train,
method = "lm")
train
is the central function of the caret
library. It fits the algorithm into the data using the specified method
. In this case, we are fitting a regression line on the mtcars
data frame, trying to predict the miles-per-gallon using the car’s horsepower, weight, number of gears and engine displacement.
Normally the method is deeply tied with the standalone function you would use to train models. For example, method='lm'
will produce something similar to the lm
function. Cool thing, we can treat the caret
generated lm_model
as our other linear models in R — calling summary
is possible:
summary(lm_model)Call:
lm(formula = .outcome ~ ., data = dat)Residuals:
Min 1Q Median 3Q Max
-3.6056 -1.8792 -0.4769 1.0658 5.6892Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 31.67555 6.18241 5.123 7.11e-05 ***
hp -0.04911 0.01883 -2.609 0.01777 *
wt -3.89388 1.28164 -3.038 0.00707 **
gear 1.49329 1.29311 1.155 0.26327
disp 0.01265 0.01482 0.854 0.40438
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 2.915 on 18 degrees of freedom
Multiple R-squared: 0.8246, Adjusted R-squared: 0.7856
F-statistic: 21.15 on 4 and 18 DF, p-value: 1.326e-06
If you fit a linear model using lm(mpg ~ hp + wt + gear + disp, data = mtcars_train)
you’ll obtain exactly the same coefficients. So.. what is, exactly, the advantage of using caret
?
One of the most important is what we’ll see in the next section — changing models is sooo easy!
To change between models in caret
, we just have to change the method
inside our train
function — let’s fit a logistic regression on our iris data frame:
glm_model <- train(target ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = iris_train,
method = "glm",
family = "binomial")
Using method='glm'
we open the possibility of training a logistic regression. Notice that I can also access other arguments of the method by passing them into the train function — family='binomial'
is the parameter we can use to let R know that we want to train a logistic regression. Extending parameters so easily inside the train
function depending on the method
we are using, is another great feature of the caret
library.
We can, of course, see the logistic regression result with summary
:
summary(glm_model)Call:
NULLDeviance Residuals:
Min 1Q Median 3Q Max
-2.487e-05 -2.110e-08 -2.110e-08 2.110e-08 1.723e-05Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 45.329 575717.650 0 1
Sepal.Length -5.846 177036.957 0 1
Sepal.Width 11.847 88665.772 0 1
Petal.Length -16.524 126903.905 0 1
Petal.Width -7.199 185972.824 0 1(Dispersion parameter for binomial family taken to be 1)Null deviance: 1.2821e+02 on 99 degrees of freedom
Residual deviance: 1.7750e-09 on 95 degrees of freedom
AIC: 10Number of Fisher Scoring iterations: 25
Don’t worry too much about the results right now — the important part is that you understand how easy it was to switch between models using the train
function.
A jump from linear to logistic regression doesn’t seem so impressive.. are we just restricted to linear models? Nope! Let’s see some tree based models, next.
We also have the availability to fit tree-based models using train
just by switching the method
as we’ve done when we switched between linear and logistic regression:
d.tree <- train(mpg ~ hp + wt + gear + disp,
data = mtcars_train,
method = "rpart")
The cool thing? This is an rpart
object — we can even plot it using rpart.plot
:
library(rpart.plot)
rpart.plot(d.tree$finalModel)
To plot the decision tree, we just need to access the finalModel
object of d.tree
, that is a mimic of therpart
counterpart.
A significant difference between caret
or using the rpart
function as a stand alone, is that the latter will not perform any hyperparameter tuning. On the other hand, caret’s rpart
already performs some mini “hyperparameter tuning” with the complexity parameter (cp
) as we can see when we detail our d.tree
:
d.treeCART23 samples
4 predictorNo pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 23, 23, 23, 23, 23, 23, ...
Resampling results across tuning parameters:cp RMSE Rsquared MAE
0.0000000 4.805918 0.4932426 3.944907
0.3169698 4.805918 0.4932426 3.944907
0.6339397 4.953153 0.5061275 4.105045RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.3169698.
In our case, the lowest RMSE (Root Mean Squared Error) ended up being between cp
= 0 or cp
= 0.32 and caret
chose one of those values as the final one for our decision tree. We can also perform our custom hyperparameter tuning inside the train
function something we will see a bit later in this post!
To fit a randomForest, there are several methods
we can use — personally, I enjoy using the ranger
implementation by providing that in the argument of the train
function:
r.forest <- train(mpg ~ hp + wt + gear + disp,
data = mtcars_train,
method = "ranger")
Let’s see our r.forest
object:
r.forestRandom Forest23 samples
4 predictorNo pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 23, 23, 23, 23, 23, 23, ...
Resampling results across tuning parameters:mtry splitrule RMSE Rsquared MAE
2 variance 2.672097 0.8288142 2.160861
2 extratrees 2.844143 0.8054631 2.285504
3 variance 2.700335 0.8212250 2.189822
3 extratrees 2.855688 0.8024398 2.295482
4 variance 2.724485 0.8144731 2.213220
4 extratrees 2.892709 0.7918827 2.337236Tuning parameter 'min.node.size' was held constant at a value of 5
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 2, splitrule =
variance and min.node.size = 5.
Just like in our Decision Tree example, caret
is also performing some simple hyperparameter tuning on it’s own before choosing the final model to use. In this case, train
performs a 2 argument hyperparameter tuning with mtry
and splitrule
. It chose mtry
= 2 and splitrule
= variance as the final model.
Notice that the hyperparameters that the train
function is using are deeply tied with the ones available in the method
!
Keep in mind that this is the first train
example where we’ve used a method
that is not native to R — when you try to run the code above, R will prompt you to install the ranger
library, if you don’t have it. That’s another cool caret
detail — we can use a lot of models that are not available in base R.
Bottom line is, the train
function is just this high level API that manages everything for us, regarding model training, with a common interface.
One of the most famous types of models used in Kaggle competitions are gradient boosting models. These types of tree based models are famous for their stability and performance.
XGBoost is one implementation of these boosting models that rely on model’s errors to improve their performance. As with Random Forest, we can access different boosting implementations in caret
— here’s a small example using the xbgTree
library:
xg.boost <- train(mpg ~ hp + wt + gear + disp,
data = mtcars_train,
method = "xgbTree")
The xgbTree
implementation performs some extensive tuning (it may take a while to run) — if we look at the output:
xg.boosteXtreme Gradient Boosting23 samples
4 predictorNo pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 23, 23, 23, 23, 23, 23, ...
Resampling results across tuning parameters:eta max_depth colsample_bytree subsample nrounds RMSE
0.3 1 0.6 0.50 50 2.946174
0.3 1 0.6 0.50 100 2.944830
0.3 1 0.6 0.50 150 2.962090
0.3 1 0.6 0.75 50 3.112695
0.3 1 0.6 0.75 100 3.099010
0.3 1 0.6 0.75 150 3.110219
0.3 1 0.6 1.00 50 3.077528
0.3 1 0.6 1.00 100 3.097813
0.3 1 0.6 1.00 150 3.109069
0.3 1 0.8 0.50 50 3.097415
0.3 1 0.8 0.50 100 3.097322
0.3 1 0.8 0.50 150 3.098146
0.3 1 0.8 0.75 50 3.078441
0.3 1 0.8 0.75 100 3.120153
0.3 1 0.8 0.75 150 3.124199
0.3 1 0.8 1.00 50 3.174089
0.3 1 0.8 1.00 100 3.194984
...Tuning parameter 'gamma' was held constant at a value of 0Tuning parameter 'min_child_weight' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were nrounds = 50, max_depth = 3,
eta = 0.3, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1
and subsample = 1.
I’m just showcasing a couple of examples of the full hyperparameter tuning that train
performed. Notice how we could, just with a single tweak, train three different types of tree-based models without any issue. No complicated arguments, no hard interfaces between the data and the training, nothing! caret
took care of everything for us.
Let’s see two more models before checking cross-validation, hyperparameter tuning and metrics!
We also have k-nearest neighbor algorithms available on our method
:
knn <- train(mpg ~ hp + wt + gear + disp,
data = mtcars_train,
method = "knn")
For knn
, caret
looks into different k
‘s to select a final model:
knnk-Nearest Neighbors23 samples
4 predictorNo pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 23, 23, 23, 23, 23, 23, ...
Resampling results across tuning parameters:k RMSE Rsquared MAE
5 3.541489 0.7338904 2.906301
7 3.668886 0.7033751 2.909202
9 3.868597 0.6580107 2.965640RMSE was used to select the optimal model using the smallest value.
The final value used for the model was k = 5.
caret
also enables us to train our own neural networks — although you’d be better off by using other packages (h20, keras, etc.) you can also train one in caret
using nnet
, neuralnet
or mxnet
methods— here’s an example using neuralnet
neural.network <- train(mpg ~ hp + wt + gear + disp,
data = mtcars_train,
method = "neuralnet")
And seeing the result of our network:
Neural Network23 samples
4 predictorNo pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 23, 23, 23, 23, 23, 23, ...
Resampling results across tuning parameters:layer1 RMSE Rsquared MAE
1 5.916693 0.5695443 4.854666
3 5.953915 0.2311309 4.904835
5 5.700600 0.4514841 4.666083Tuning parameter 'layer2' was held constant at a value of 0Tuning parameter 'layer3' was held constant at a value of 0
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were layer1 = 5, layer2 = 0 and
layer3 = 0.
One main issue with nnet
is that it is not suited for classification tasks— for a method
that is suited for that we can use nnet
, another implementation:
neural.network.class <- train(target ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = iris_train,
method = "nnet")
If we check the neural.network.class
something odd shows up:
neural.network.classNeural Network100 samples
4 predictorNo pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 100, 100, 100, 100, 100, 100, ...
Resampling results across tuning parameters:size decay RMSE Rsquared MAE
1 0e+00 0.153228973 0.9370545 0.106767943
1 1e-04 0.074206333 0.9759492 0.052561227
1 1e-01 0.116518767 0.9977164 0.112402210
3 0e+00 0.229154294 0.9616348 0.124433888
3 1e-04 0.027172779 0.9887259 0.021662284
3 1e-01 0.080126891 0.9956402 0.074390107
5 0e+00 0.098585595 0.9999507 0.060258715
5 1e-04 0.004105796 0.9999710 0.003348211
5 1e-01 0.073134836 0.9944261 0.065199393RMSE was used to select the optimal model using the smallest value.
The final values used for the model were size = 5 and decay = 1e-04.
Our neural network is still treating this as a regression problem (according to the metrics it is trying to optimize) — why?
Because nnet
needs the target to be a factor to understand that we are trying to do a classification task — this is one of the gotchas when using caret
— you still need to understand some portion of specific methods
to avoid any bugs in your code. Let’s turn our target
into a factor column:
iris_train$target = factor(iris_train$target)
And now fit our neural network again:
neural.network.class.2 <- train(target ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data = iris_train,
method = "nnet")
And viewing our object neural.network.class.2
:
Neural Network100 samples
4 predictor
2 classes: '0', '1'No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 100, 100, 100, 100, 100, 100, ...
Resampling results across tuning parameters:size decay Accuracy Kappa
1 0e+00 0.9591083 0.8776744
1 1e-04 0.9507270 0.8400000
1 1e-01 1.0000000 1.0000000
3 0e+00 0.9845919 0.9474580
3 1e-04 1.0000000 1.0000000
3 1e-01 1.0000000 1.0000000
5 0e+00 0.9988889 0.9969231
5 1e-04 1.0000000 1.0000000
5 1e-01 1.0000000 1.0000000Accuracy was used to select the optimal model using the largest value.
The final values used for the model were size = 1 and decay = 0.1.
Cool! Our object is now showing Accuracy
, a metric that is only relevant for classification tasks. What parameters is caret
modifying in the tuning process? size
and decay
!
We got a good coverage on different methods
here — you may be asking, did we left out some of them? Yes, we did — at the time that I’m writing this, about 231 methods! Feel free to experiment with some of them!
In caret
you can also give your custom cross-validation method to the train
function. For instance, let’s use a k-fold cross validation on a decision tree in the example below:
ctrl<- trainControl(method="cv", number=10)d.tree.kfold <- train(mpg ~ hp + wt + gear + disp,
data = mtcars_train,
trControl = ctrl,
method = "rpart")
So.. what’s new? Two things:
- We are defining a
ctrl
object with the functiontrainControl
. In the function, we are defining themethod
of our experiment ascv
(cross-validation) withnumber
equals to 10,number
is the same as k. - We are passing our k-fold cross validation to the
trControl
argument inside thetrain
. This will letcaret
understand that we want to apply this cross validation method inside our training process.
Let’s check our decision tree:
dtreeCART23 samples
4 predictorNo pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 21, 20, 21, 21, 21, 21, ...
Resampling results across tuning parameters:cp RMSE Rsquared MAE
0.0000000 5.717836 0.7373485 4.854541
0.3169698 5.717836 0.7373485 4.854541
0.6339397 6.071459 0.6060227 5.453586RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.3169698.
Notice that now we can see the results regarding an evaluation on a Resampling: Cross-Validated (10 fold)
, confirming that we are using a different cross-validation method.
We can also choose our algorithms’ result according to different metrics. For example, in the output of our Decision Tree, take a deeper look into the final paragraph:
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.3169698.
Our algorithm was optimizing for the root mean squared error. What if we want to optimize for another metric, such as R-Squared? We can do that with the argument metric
:
d.tree.kfold.rsquared <- train(mpg ~ hp + wt + gear + disp,
data = mtcars_train,
trControl = ctrl,
metric = "Rsquared",
method = "rpart")
If we check the output, we have the following:
d.tree.kfold.rsquaredCART23 samples
4 predictorNo pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 21, 21, 21, 21, 21, 20, ...
Resampling results across tuning parameters:cp RMSE Rsquared MAE
0.0000000 4.943094 0.8303833 4.342684
0.3169698 4.943094 0.8303833 4.342684
0.6339397 6.022911 0.7031709 5.472432Rsquared was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.3169698.
In this example, this does not make any difference as the result from the three metrics (RMSE, Rsquared and MAE) will always point to the same cp
. In the case where you have more hyperparameters, that is not true and they may differ.
Nevertheless, it’s really cool that we can choose different model parameters just by tweaking the metric
argument! Check all the metrics you can use here.
Although caret
will do some default hyperparameter tuning by itself, it will normally be a bit simplistic for most machine learning tasks.
For example, what if I want to check other cp
parameters? We can give that using the tuneGrid
argument in our train
function:
d.tree.hyperparam <- train(mpg ~ hp + wt + gear + disp,
data = mtcars_train,
trControl = ctrl,
metric = "Rsquared",
method = "rpart",
tuneGrid = data.frame(
cp = c(0.0001, 0.001, 0.35, 0.65)))
By passing a data.frame
inside the tuneGrid
, I can optimize for custom cp
inside my decision tree. One of the shortcomings of caret
is that other parameters (such as maxdepth, minsplit or others) are not tunable via the train
function for the rpart
method. You can view all parameters you can tune for each method
by visiting the available models page on caret’s official documentation.
An example of tuning several parameters using the ranger
method:
r.forest.hyperparam <- train(mpg ~ hp + wt + gear + disp,
data = mtcars_train,
trControl = ctrl,
metric = "Rsquared",
method = "ranger",
tuneGrid = data.frame(
mtry = c(2, 3, 4),
min.node.size = c(2, 4, 10),
splitrule = c('variance')))
This will produce the following output:
Random Forest23 samples
4 predictorNo pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 21, 20, 21, 21, 21, 20, ...
Resampling results across tuning parameters:mtry min.node.size RMSE Rsquared MAE
2 2 2.389460 0.9377971 2.192705
3 4 2.325312 0.9313336 2.112829
4 10 2.723303 0.9147017 2.425553Tuning parameter 'splitrule' was held constant at a value of variance
Rsquared was used to select the optimal model using the largest value.
The final values used for the model were mtry = 2, splitrule =
variance and min.node.size = 2.
As we are keeping variance
constant, our random forest will be optimized for mtry
and min.node.size
. Notice how we’ve extended our hyperparameter tuning to more variables by giving extra columns to the data.frame we feed into tuneGrid
.
Finally, how do we predict stuff on new data? Just like we do in other models! Using a predict
is as simple as it can gets with caret
models — using different models:
# Linear Regression
predict(lm_model, mtcars_test)# Decision Tree
predict(d.tree, mtcars_test)# Random Forest
predict(r.forest, mtcars_test)# XGBoost
predict(xg.boost, mtcars_test)
How cool is that? We can even pass different predictions into a scatter plot to understand how our different models are classifying the same example — comparing the random forest with our xgboost:
library(ggplot2)ggplot(
data = predictions,
aes(x = rf, y = xgboost)
) + geom_point()
I hope you’ve enjoyed this post and it helps you master this great machine learning library. caret
has a lot of many features that makes it a go-to library to build interesting data science models using R Programming.
If you would like to drop by my R courses, feel free to join here (R Programming for Absolute Beginners) or here (Data Science Bootcamp). My courses teach both R Programming and Data Science and I would love to have you around!
Here’s a small gist with the code we’ve used on this post: