## Learn how to work with the caret (Classification and Regression Training) package using R

Caret is a pretty powerful machine learning library in R. With flexibility as its main feature, `caret`

enables you to train different types of algorithms using a simple `train`

function. This layer of abstraction provides a common interface to train models in R, just by tweaking an argument — the `method`

.

`caret`

(for Classification and Regression Training) is one of the most popular machine learning libraries in R. Adding to the flexibility, we get embedding hyperparameter tuning and cross validation — two techniques that will improve our algorithm’s generalization power.

In this guide, we are going to explore the package in four different dimensions:

- We’ll start by learning
**how to train different models**by changing the`method`

argument. With each algorithm we’ll train, we’ll be exposed to new concepts such as hyperparameter tuning, cross-validation, factors and other meaningful details. - Then, we’ll learn how to setup our own
**custom cross-validation function**followed by tweaking our algorithms for**different optimization metrics.** - Experimenting with our own
**hyperparameter tuning.** - We’ll wrap everything by checking how the
`predict`

works with different`caret`

models.

For simplicity, and because we want to focus on the library itself, we’ll use two of the most famous toy datasets available in R:

- The
*iris*dataset, a very well known dataset that will represent our classification task. - The
*mtcars*dataset that will be used as our regression task.

I’ll do a small tweak to make the *iris *problem a binary one (unfortunately `glm`

, the logistic regression implementation in R doesn’t support multiclass problems):

`iris$target <- ifelse(`

iris$Species == 'setosa',

1,

0

)

As we’ll want to evaluate the predict method, let’s split our two dataframes into train and test first — I’ll use a personal favorite, `caTools`

:

library(caTools)# Train Test Split on both Iris and Mtcars

train_test_split <- function(df) {

set.seed(42)

sample = sample.split(df, SplitRatio = 0.8)

train = subset(df, sample == TRUE)

test = subset(df, sample == FALSE)

return (list(train, test))

}# Unwrapping mtcars

mtcars_train <- train_test_split(mtcars)[[1]]

mtcars_test <- train_test_split(mtcars)[[2]]# Unwrapping iris

iris_train <- train_test_split(iris)[[1]]

iris_test <- train_test_split(iris)[[2]]

All set, let’s start!

Our first `caret`

example consists of fitting a simple linear regression. Linear regression is one of the most well known algorithms. In base R, you can fit one using the `lm`

method. In `caret`

, you can do the same using:

`lm_model <- train(mpg ~ hp + wt + gear + disp, `

data = mtcars_train,

method = "lm")

`train`

is the central function of the `caret`

library. It fits the algorithm into the data using the specified `method`

. In this case, we are fitting a regression line on the `mtcars`

data frame, trying to predict the miles-per-gallon using the car’s horsepower, weight, number of gears and engine displacement.

Normally the method is deeply tied with the standalone function you would use to train models. For example, `method='lm'`

will produce something similar to the `lm`

function. Cool thing, we can treat the `caret`

generated `lm_model`

as our other linear models in R — calling `summary`

is possible:

summary(lm_model)Call:

lm(formula = .outcome ~ ., data = dat)Residuals:

Min 1Q Median 3Q Max

-3.6056 -1.8792 -0.4769 1.0658 5.6892Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 31.67555 6.18241 5.123 7.11e-05 ***

hp -0.04911 0.01883 -2.609 0.01777 *

wt -3.89388 1.28164 -3.038 0.00707 **

gear 1.49329 1.29311 1.155 0.26327

disp 0.01265 0.01482 0.854 0.40438

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 2.915 on 18 degrees of freedom

Multiple R-squared: 0.8246, Adjusted R-squared: 0.7856

F-statistic: 21.15 on 4 and 18 DF, p-value: 1.326e-06

If you fit a linear model using `lm(mpg ~ hp + wt + gear + disp, data = mtcars_train)`

you’ll obtain exactly the same coefficients. So.. what is, exactly, the advantage of using `caret`

?

One of the most important is what we’ll see in the next section — changing models is sooo easy!

To change between models in `caret`

, we just have to change the `method`

inside our `train`

function — let’s fit a logistic regression on our iris data frame:

`glm_model <- train(target ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, `

data = iris_train,

method = "glm",

family = "binomial")

Using `method='glm'`

we open the possibility of training a logistic regression. Notice that I can also access other arguments of the method by passing them into the train function — `family='binomial'`

is the parameter we can use to let R know that we want to train a logistic regression. Extending parameters so easily inside the `train`

function depending on the `method`

we are using, is another great feature of the `caret`

library.

We can, of course, see the logistic regression result with `summary`

:

summary(glm_model)Call:

NULLDeviance Residuals:

Min 1Q Median 3Q Max

-2.487e-05 -2.110e-08 -2.110e-08 2.110e-08 1.723e-05Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 45.329 575717.650 0 1

Sepal.Length -5.846 177036.957 0 1

Sepal.Width 11.847 88665.772 0 1

Petal.Length -16.524 126903.905 0 1

Petal.Width -7.199 185972.824 0 1(Dispersion parameter for binomial family taken to be 1)Null deviance: 1.2821e+02 on 99 degrees of freedom

Residual deviance: 1.7750e-09 on 95 degrees of freedom

AIC: 10Number of Fisher Scoring iterations: 25

Don’t worry too much about the results right now — the important part is that you understand how easy it was to switch between models using the `train`

function.

A jump from linear to logistic regression doesn’t seem so impressive.. are we just restricted to linear models? Nope! Let’s see some tree based models, next.

We also have the availability to fit tree-based models using `train`

just by switching the `method`

as we’ve done when we switched between linear and logistic regression:

`d.tree <- train(mpg ~ hp + wt + gear + disp, `

data = mtcars_train,

method = "rpart")

The cool thing? This is an `rpart`

object — we can even plot it using `rpart.plot`

:

`library(rpart.plot)`

rpart.plot(d.tree$finalModel)

To plot the decision tree, we just need to access the `finalModel`

object of `d.tree`

, that is a mimic of the`rpart`

counterpart.

A significant difference between `caret`

or using the `rpart`

function as a stand alone, is that the latter will not perform any hyperparameter tuning. On the other hand, caret’s `rpart`

already performs some mini “hyperparameter tuning” with the complexity parameter (`cp`

) as we can see when we detail our `d.tree`

:

d.treeCART23 samples

4 predictorNo pre-processing

Resampling: Bootstrapped (25 reps)

Summary of sample sizes: 23, 23, 23, 23, 23, 23, ...

Resampling results across tuning parameters:cp RMSE Rsquared MAE

0.0000000 4.805918 0.4932426 3.944907

0.3169698 4.805918 0.4932426 3.944907

0.6339397 4.953153 0.5061275 4.105045RMSE was used to select the optimal model using the smallest value.

The final value used for the model was cp = 0.3169698.

In our case, the lowest RMSE (Root Mean Squared Error) ended up being between `cp`

= 0 or `cp`

= 0.32 and `caret`

chose one of those values as the final one for our decision tree. We can also perform our custom hyperparameter tuning inside the `train`

function something we will see a bit later in this post!

To fit a randomForest, there are several `methods`

we can use — personally, I enjoy using the `ranger`

implementation by providing that in the argument of the `train`

function:

`r.forest <- train(mpg ~ hp + wt + gear + disp, `

data = mtcars_train,

method = "ranger")

Let’s see our `r.forest`

object:

r.forestRandom Forest23 samples

4 predictorNo pre-processing

Resampling: Bootstrapped (25 reps)

Summary of sample sizes: 23, 23, 23, 23, 23, 23, ...

Resampling results across tuning parameters:mtry splitrule RMSE Rsquared MAE

2 variance 2.672097 0.8288142 2.160861

2 extratrees 2.844143 0.8054631 2.285504

3 variance 2.700335 0.8212250 2.189822

3 extratrees 2.855688 0.8024398 2.295482

4 variance 2.724485 0.8144731 2.213220

4 extratrees 2.892709 0.7918827 2.337236Tuning parameter 'min.node.size' was held constant at a value of 5

RMSE was used to select the optimal model using the smallest value.

The final values used for the model were mtry = 2, splitrule =

variance and min.node.size = 5.

Just like in our Decision Tree example, `caret`

is also performing some simple hyperparameter tuning on it’s own before choosing the final model to use. In this case, `train`

performs a 2 argument hyperparameter tuning with `mtry`

and `splitrule`

. It chose `mtry`

= 2 and `splitrule`

= variance as the final model.

Notice that the hyperparameters that the `train`

function is using are deeply tied with the ones available in the `method`

!

Keep in mind that this is the first `train`

example where we’ve used a `method`

that is not native to R — when you try to run the code above, R will prompt you to install the `ranger`

library, if you don’t have it. That’s another cool `caret`

detail — we can use a lot of models that are not available in base R.

Bottom line is, the `train`

function is just this high level API that manages everything for us, regarding model training, with a common interface.

One of the most famous types of models used in Kaggle competitions are gradient boosting models. These types of tree based models are famous for their stability and performance.

*XGBoost* is one implementation of these boosting models that rely on model’s errors to improve their performance. As with Random Forest, we can access different boosting implementations in `caret`

— here’s a small example using the `xbgTree`

library:

`xg.boost <- train(mpg ~ hp + wt + gear + disp, `

data = mtcars_train,

method = "xgbTree")

The `xgbTree`

implementation performs some extensive tuning (it may take a while to run) — if we look at the output:

xg.boosteXtreme Gradient Boosting23 samples

4 predictorNo pre-processing

Resampling: Bootstrapped (25 reps)

Summary of sample sizes: 23, 23, 23, 23, 23, 23, ...

Resampling results across tuning parameters:eta max_depth colsample_bytree subsample nrounds RMSE

0.3 1 0.6 0.50 50 2.946174

0.3 1 0.6 0.50 100 2.944830

0.3 1 0.6 0.50 150 2.962090

0.3 1 0.6 0.75 50 3.112695

0.3 1 0.6 0.75 100 3.099010

0.3 1 0.6 0.75 150 3.110219

0.3 1 0.6 1.00 50 3.077528

0.3 1 0.6 1.00 100 3.097813

0.3 1 0.6 1.00 150 3.109069

0.3 1 0.8 0.50 50 3.097415

0.3 1 0.8 0.50 100 3.097322

0.3 1 0.8 0.50 150 3.098146

0.3 1 0.8 0.75 50 3.078441

0.3 1 0.8 0.75 100 3.120153

0.3 1 0.8 0.75 150 3.124199

0.3 1 0.8 1.00 50 3.174089

0.3 1 0.8 1.00 100 3.194984

...Tuning parameter 'gamma' was held constant at a value of 0Tuning parameter 'min_child_weight' was held constant at a value of 1

RMSE was used to select the optimal model using the smallest value.

The final values used for the model were nrounds = 50, max_depth = 3,

eta = 0.3, gamma = 0, colsample_bytree = 0.6, min_child_weight = 1

and subsample = 1.

I’m just showcasing a couple of examples of the full hyperparameter tuning that `train`

performed. Notice how we could, just with a single tweak, train three different types of tree-based models without any issue. No complicated arguments, no hard interfaces between the data and the training, nothing! `caret`

took care of everything for us.

Let’s see two more models before checking cross-validation, hyperparameter tuning and metrics!

We also have k-nearest neighbor algorithms available on our `method`

:

`knn <- train(mpg ~ hp + wt + gear + disp, `

data = mtcars_train,

method = "knn")

For `knn`

, `caret`

looks into different `k`

‘s to select a final model:

knnk-Nearest Neighbors23 samples

4 predictorNo pre-processing

Resampling: Bootstrapped (25 reps)

Summary of sample sizes: 23, 23, 23, 23, 23, 23, ...

Resampling results across tuning parameters:k RMSE Rsquared MAE

5 3.541489 0.7338904 2.906301

7 3.668886 0.7033751 2.909202

9 3.868597 0.6580107 2.965640RMSE was used to select the optimal model using the smallest value.

The final value used for the model was k = 5.

`caret`

also enables us to train our own neural networks — although you’d be better off by using other packages (h20, keras, etc.) you can also train one in `caret`

using `nnet`

, `neuralnet`

or `mxnet`

methods— here’s an example using `neuralnet`

`neural.network <- train(mpg ~ hp + wt + gear + disp, `

data = mtcars_train,

method = "neuralnet")

And seeing the result of our network:

Neural Network23 samples

4 predictorNo pre-processing

Resampling: Bootstrapped (25 reps)

Summary of sample sizes: 23, 23, 23, 23, 23, 23, ...

Resampling results across tuning parameters:layer1 RMSE Rsquared MAE

1 5.916693 0.5695443 4.854666

3 5.953915 0.2311309 4.904835

5 5.700600 0.4514841 4.666083Tuning parameter 'layer2' was held constant at a value of 0Tuning parameter 'layer3' was held constant at a value of 0

RMSE was used to select the optimal model using the smallest value.

The final values used for the model were layer1 = 5, layer2 = 0 and

layer3 = 0.

One main issue with `nnet`

is that it is not suited for classification tasks— for a `method`

that is suited for that we can use `nnet`

, another implementation:

`neural.network.class <- train(target ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, `

data = iris_train,

method = "nnet")

If we check the `neural.network.class`

something odd shows up:

neural.network.classNeural Network100 samples

4 predictorNo pre-processing

Resampling: Bootstrapped (25 reps)

Summary of sample sizes: 100, 100, 100, 100, 100, 100, ...

Resampling results across tuning parameters:size decay RMSE Rsquared MAE

1 0e+00 0.153228973 0.9370545 0.106767943

1 1e-04 0.074206333 0.9759492 0.052561227

1 1e-01 0.116518767 0.9977164 0.112402210

3 0e+00 0.229154294 0.9616348 0.124433888

3 1e-04 0.027172779 0.9887259 0.021662284

3 1e-01 0.080126891 0.9956402 0.074390107

5 0e+00 0.098585595 0.9999507 0.060258715

5 1e-04 0.004105796 0.9999710 0.003348211

5 1e-01 0.073134836 0.9944261 0.065199393RMSE was used to select the optimal model using the smallest value.

The final values used for the model were size = 5 and decay = 1e-04.

Our neural network is still treating this as a regression problem (according to the metrics it is trying to optimize) — why?

Because `nnet`

needs the target to be a factor to understand that we are trying to do a classification task — this is one of the *gotchas *when using `caret`

— you still need to understand some portion of specific `methods`

to avoid any bugs in your code. Let’s turn our `target`

into a factor column:

`iris_train$target = factor(iris_train$target)`

And now fit our neural network again:

`neural.network.class.2 <- train(target ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, `

data = iris_train,

method = "nnet")

And viewing our object `neural.network.class.2`

:

Neural Network100 samples

4 predictor

2 classes: '0', '1'No pre-processing

Resampling: Bootstrapped (25 reps)

Summary of sample sizes: 100, 100, 100, 100, 100, 100, ...

Resampling results across tuning parameters:size decay Accuracy Kappa

1 0e+00 0.9591083 0.8776744

1 1e-04 0.9507270 0.8400000

1 1e-01 1.0000000 1.0000000

3 0e+00 0.9845919 0.9474580

3 1e-04 1.0000000 1.0000000

3 1e-01 1.0000000 1.0000000

5 0e+00 0.9988889 0.9969231

5 1e-04 1.0000000 1.0000000

5 1e-01 1.0000000 1.0000000Accuracy was used to select the optimal model using the largest value.

The final values used for the model were size = 1 and decay = 0.1.

Cool! Our object is now showing `Accuracy`

, a metric that is only relevant for classification tasks. What parameters is `caret`

modifying in the tuning process? `size`

and `decay`

!

We got a good coverage on different `methods`

here — you may be asking, did we left out some of them? Yes, we did — at the time that I’m writing this, about 231 methods! Feel free to experiment with some of them!

In `caret`

you can also give your custom cross-validation method to the `train`

function. For instance, let’s use a k-fold cross validation on a decision tree in the example below:

ctrl<- trainControl(method="cv", number=10)d.tree.kfold <- train(mpg ~ hp + wt + gear + disp,

data = mtcars_train,

trControl = ctrl,

method = "rpart")

So.. what’s new? Two things:

- We are defining a
`ctrl`

object with the function`trainControl`

. In the function, we are defining the`method`

of our experiment as`cv`

(cross-validation) with`number`

equals to 10,`number`

is the same as k. - We are passing our k-fold cross validation to the
`trControl`

argument inside the`train`

. This will let`caret`

understand that we want to apply this cross validation method inside our training process.

Let’s check our decision tree:

dtreeCART23 samples

4 predictorNo pre-processing

Resampling: Cross-Validated (10 fold)

Summary of sample sizes: 21, 20, 21, 21, 21, 21, ...

Resampling results across tuning parameters:cp RMSE Rsquared MAE

0.0000000 5.717836 0.7373485 4.854541

0.3169698 5.717836 0.7373485 4.854541

0.6339397 6.071459 0.6060227 5.453586RMSE was used to select the optimal model using the smallest value.

The final value used for the model was cp = 0.3169698.

Notice that now we can see the results regarding an evaluation on a `Resampling: Cross-Validated (10 fold)`

, confirming that we are using a different cross-validation method.

We can also choose our algorithms’ result according to different metrics. For example, in the output of our Decision Tree, take a deeper look into the final paragraph:

`RMSE was used to select the optimal model using the smallest value.`

The final value used for the model was cp = 0.3169698.

Our algorithm was optimizing for the root mean squared error. What if we want to optimize for another metric, such as R-Squared? We can do that with the argument `metric`

:

`d.tree.kfold.rsquared <- train(mpg ~ hp + wt + gear + disp, `

data = mtcars_train,

trControl = ctrl,

metric = "Rsquared",

method = "rpart")

If we check the output, we have the following:

d.tree.kfold.rsquaredCART23 samples

4 predictorNo pre-processing

Resampling: Cross-Validated (10 fold)

Summary of sample sizes: 21, 21, 21, 21, 21, 20, ...

Resampling results across tuning parameters:cp RMSE Rsquared MAE

0.0000000 4.943094 0.8303833 4.342684

0.3169698 4.943094 0.8303833 4.342684

0.6339397 6.022911 0.7031709 5.472432Rsquared was used to select the optimal model using the largest value.

The final value used for the model was cp = 0.3169698.

In this example, this does not make any difference as the result from the three metrics (RMSE, Rsquared and MAE) will always point to the same `cp`

. In the case where you have more hyperparameters, that is not true and they may differ.

Nevertheless, it’s really cool that we can choose different model parameters just by tweaking the `metric`

argument! Check all the metrics you can use here.

Although `caret`

will do some default hyperparameter tuning by itself, it will normally be a bit simplistic for most machine learning tasks.

For example, what if I want to check other `cp`

parameters? We can give that using the `tuneGrid`

argument in our `train`

function:

`d.tree.hyperparam <- train(mpg ~ hp + wt + gear + disp, `

data = mtcars_train,

trControl = ctrl,

metric = "Rsquared",

method = "rpart",

tuneGrid = data.frame(

cp = c(0.0001, 0.001, 0.35, 0.65)))

By passing a `data.frame`

inside the `tuneGrid`

, I can optimize for custom `cp`

inside my decision tree. One of the shortcomings of `caret`

is that other parameters (such as maxdepth, minsplit or others) are not tunable via the `train`

function for the `rpart`

method. You can view all parameters you can tune for each `method`

by visiting the available models page on caret’s official documentation.

An example of tuning several parameters using the `ranger`

method:

`r.forest.hyperparam <- train(mpg ~ hp + wt + gear + disp, `

data = mtcars_train,

trControl = ctrl,

metric = "Rsquared",

method = "ranger",

tuneGrid = data.frame(

mtry = c(2, 3, 4),

min.node.size = c(2, 4, 10),

splitrule = c('variance')))

This will produce the following output:

Random Forest23 samples

4 predictorNo pre-processing

Resampling: Cross-Validated (10 fold)

Summary of sample sizes: 21, 20, 21, 21, 21, 20, ...

Resampling results across tuning parameters:mtry min.node.size RMSE Rsquared MAE

2 2 2.389460 0.9377971 2.192705

3 4 2.325312 0.9313336 2.112829

4 10 2.723303 0.9147017 2.425553Tuning parameter 'splitrule' was held constant at a value of variance

Rsquared was used to select the optimal model using the largest value.

The final values used for the model were mtry = 2, splitrule =

variance and min.node.size = 2.

As we are keeping `variance`

constant, our random forest will be optimized for `mtry`

and `min.node.size`

. Notice how we’ve extended our hyperparameter tuning to more variables by giving extra columns to the data.frame we feed into `tuneGrid`

.

Finally, how do we predict stuff on new data? Just like we do in other models! Using a `predict`

is as simple as it can gets with `caret`

models — using different models:

# Linear Regression

predict(lm_model, mtcars_test)# Decision Tree

predict(d.tree, mtcars_test)# Random Forest

predict(r.forest, mtcars_test)# XGBoost

predict(xg.boost, mtcars_test)

How cool is that? We can even pass different predictions into a scatter plot to understand how our different models are classifying the same example — comparing the random forest with our xgboost:

library(ggplot2)ggplot(

data = predictions,

aes(x = rf, y = xgboost)

) + geom_point()

I hope you’ve enjoyed this post and it helps you master this great machine learning library. `caret`

has a lot of many features that makes it a go-to library to build interesting data science models using R Programming.

If you would like to drop by my R courses, feel free to join here (R Programming for Absolute Beginners) or here (Data Science Bootcamp). My courses teach both R Programming and Data Science and I would love to have you around!

Here’s a small gist with the code we’ve used on this post: