Optimizing model parameters faster with tidymodels

A couple small changes can greatly speed up the hyperparameter tuning process with tidymodels.

Published

August 4, 2023

Especially for large amounts of data or large grids of parameters, the time and resources needed to tune statistical models can quickly become a barrier to getting machine learning models deployed. Many examples online demonstrating how to tune hyperparameters with tidymodels use tune_grid() on one CPU core. Making use of parallel processing and using a near-drop-in replacement for tune_grid() can speed up hyperparameter tuning by 20-30x!

Parallel processing

The first optimization is parallel processing. At the moment, I’m writing this blog post on a computer with 10 cores. Without any additional configuration, R will only make use of one of those cores to carry out computations. Parallel processing tools will allow me to allot all of the computations under the hood in tune_grid() to best make use of all of the cores available to me.

tidymodels supports a variety of parallel processing backends. I’ll set one up using the parallel and doParallel packages:

library(doMC)
registerDoMC(cores = 10)

Running the same tuning code as before, now, tune_grid() will pick up on our newly registered backend and distribute its computations across the available cores:

set.seed(4)

bt_time_par <- system.time(
  bt_res_par <- tune_grid(bt_wflow, ames_folds, grid = 50)
)

In total, these results took 136 seconds to generate! This is about 7.4 times faster than the single-core approach. There’s a bit of overhead associating with sending computations out to different cores and then compiling them back together, and some of the computations needed to tune hyperparameters can’t be parallelized, so the factor of speedup when using parallel processing while tuning is slightly less than the factor increase in number of cores.

Already, a 7.4x speedup is a gamechanger for many machine learning workflows! This optimization assumes you have access to unutilized computing resources, though, which may not be the case for you; this next one doesn’t require any extra computing power.

Racing

In these previous examples, we’ve evaluated 50 model configurations against 20 resamples, resulting in 1000 model fits. We can use an approach called racing to cut down on that number of model fits.

Racing via ANOVA models, implemented in the tune_race_anova() function in finetune, works by fitting each model configuration to only a few of the resamples and then, at each additional resample, removing model configurations that are unlikely to be more performant than the others.

library(finetune)

Visualization helpers in the finetune package are super effective for developing intuition on this process, so let’s go ahead and tune using racing and then use the output to learn more about the process. First, we’ll bump the number of cores back to one so that we can observe the effect of only racing:

registerDoMC(cores = 1)

Now, tuning using racing:

set.seed(5)

bt_time_race <- system.time(
  bt_res_race <- tune_race_anova(bt_wflow, ames_folds, grid = 50)
)

The resulting object bt_res_race looks quite similar to the typical grid search output:

bt_res_race
## # Tuning results
## # 20-fold cross-validation 
## # A tibble: 20 × 5
##    splits             id     .order .metrics           .notes          
##    <list>             <chr>   <int> <list>             <list>          
##  1 <split [2087/110]> Fold01      3 <tibble [100 × 8]> <tibble [0 × 3]>
##  2 <split [2087/110]> Fold05      1 <tibble [100 × 8]> <tibble [0 × 3]>
##  3 <split [2087/110]> Fold10      2 <tibble [100 × 8]> <tibble [0 × 3]>
##  4 <split [2087/110]> Fold16      4 <tibble [8 × 8]>   <tibble [0 × 3]>
##  5 <split [2087/110]> Fold15      5 <tibble [6 × 8]>   <tibble [0 × 3]>
##  6 <split [2087/110]> Fold11      6 <tibble [4 × 8]>   <tibble [0 × 3]>
##  7 <split [2087/110]> Fold04      7 <tibble [4 × 8]>   <tibble [0 × 3]>
##  8 <split [2087/110]> Fold02     14 <tibble [2 × 8]>   <tibble [0 × 3]>
##  9 <split [2087/110]> Fold03     19 <tibble [2 × 8]>   <tibble [0 × 3]>
## 10 <split [2087/110]> Fold06     15 <tibble [2 × 8]>   <tibble [0 × 3]>
## 11 <split [2087/110]> Fold07     11 <tibble [2 × 8]>   <tibble [0 × 3]>
## 12 <split [2087/110]> Fold08     16 <tibble [2 × 8]>   <tibble [0 × 3]>
## 13 <split [2087/110]> Fold09     20 <tibble [2 × 8]>   <tibble [0 × 3]>
## 14 <split [2087/110]> Fold12     10 <tibble [2 × 8]>   <tibble [0 × 3]>
## 15 <split [2087/110]> Fold13      8 <tibble [2 × 8]>   <tibble [0 × 3]>
## 16 <split [2087/110]> Fold14     13 <tibble [2 × 8]>   <tibble [0 × 3]>
## 17 <split [2087/110]> Fold17      9 <tibble [2 × 8]>   <tibble [0 × 3]>
## 18 <split [2088/109]> Fold18     18 <tibble [2 × 8]>   <tibble [0 × 3]>
## 19 <split [2088/109]> Fold19     12 <tibble [2 × 8]>   <tibble [0 × 3]>
## 20 <split [2088/109]> Fold20     17 <tibble [2 × 8]>   <tibble [0 × 3]>

The plot_race() helper visualizes the racing process:

plot_race(bt_res_race) + labs(x = "Resample #")

A ggplot2 line plot, with resample number on the x axis and root mean squared error on the y axis. For resamples 1 through 3, 50 lines show the distributions of error for each model configuration. After the 3rd resample, a vast majority of the lines disappear, and only the lines associated with the smallest error remain. Moving from left to right along the x axis, lines gradually disappear, until only one line is left, which stretches horizontally all the way to resample 20.

In this plot, we see 50 differently colored lines, each corresponding to a model configuration. For a given resample, each model configuration has an associated RMSE, plotted on the y axis.

In racing, resamples are assigned a random order and then evaluated in order. The x axis in this case thus roughly corresponds to time elapsed.

Moving from left to right, we initially see that every model configuration was evaluated with the first three resamples. After those first three resamples (called the “burn-in” period), the racing procedure stops and evaluates the distributions of RMSEs for each model configuration. Using a repeated measures ANOVA model, we can evaluate whether the distributions of RMSEs from some of the model configurations are not statistically significantly different from the current “best” RMSE distribution. The model configurations resulting in distributions of RMSEs that are not statistically significantly different from the current best, as well as the one resulting in the best RMSE distribution, are resampled again; the rest are set aside and not evaluated against any more resamples.

Intuitively, this makes sense; in the plot above, those upper-most lines are not horizontally centered in the same place as the lowest ones. Each point on that plot is a model fit, which takes lots of time to generate. If we feel confident that a model configuration defines a model that isn’t as performant as the current best one, why continue fitting it to more and more resamples? We can instead reroute those computing resources to evaluating the models that seem like contenders with the current best one.

So, after the burn-in period, we determine that many of the model configurations are no good, and evaluate only the remaining ones against the fourth resample. This adds a point to each of the distributions of errors corresponding to each model configuration, and those distributions are again passed to the repeated measures ANOVA model, which might find a couple more model configurations to remove. And so on. Ultimately, we end up only fully resampling (i.e. evaluating against all 20 resamples) a small fraction of the model configurations we initially started with.

The analogous plot for a regular grid search, like that generating bt_res_grid, would just have 50 lines stretching all the way across the x axis. Every model configuration is fully resampled:

A ggplot2 line plot like the one above, except that all 50 lines stretch all the way across the x axis. The lines vary slightly as they move from left to right, and tend not to cross each other.

A little silly.

Not having to carry out all of those additional model fits saves us a bunch of time! In total, these results took 189 seconds to generate, about 5.3 times faster than the single-core, grid search approach.

Do we see any loss in performance, though? No free lunch, Simon! Well, kind of. With some probability (see the racing control settings), the repeated measures ANOVA model will mistakenly boot a model configuration that was actually associated with performance comparable to the “best” model configuration. Note too, though, that evaluating all model configurations across all resamples doesn’t mean we can guarantee we’ve associated the true “best” model with the most performant metric. Variation abounds! So, I don’t lose sleep over it.

Again, on its own, this method is really powerful in cutting down on time-to-tune. In combination with parallel processing, we see marked speedups.

All together, now!

I’ll quickly spin up that parallel cluster again and then run the same code:

registerDoMC(cores = 10)

set.seed(6)

bt_time_race_par <- system.time(
  bt_res_race_par <- tune_race_anova(bt_wflow, ames_folds, grid = 50)
)

These results took 47 seconds to generate, 21.5 times faster than the single-core, conventional grid search approach. Scootin’! As the help-file for tune_race_anova() reads:

There is a compounding effect of racing and parallel processing but its magnitude depends on the type of model, number of resamples, number of tuning parameters, and so on.

This blog post is adapted from that help-file, which includes benchmarks from tuning a different model on a different data set, resulting in a similarly impressive speedup. Check that page out for another example!

Happy modeling. :)

Back to top