Setup

# Set all of this to get the EXACT SAME results on all platforms
set.seed(2022, "Mersenne-Twister", "Inversion", "Rejection")

library(tidyverse)
library(tidymodels)
library(kernlab) # install this if you don't have it
tidymodels_prefer()

water <- read_csv("https://tinyurl.com/mlr-water")

Hands-on Activity

Our goal is to build a model to predict Potability.

  1. Split the data into 80% training and 20% testing, stratified by the outcome variable.
  2. Prepare a 10-fold cross-validation within the training set, similarly stratified.
  3. Create a recipe that predicts the outcome variable from all other variables.
    • Add a step to drop all predictors with near-zero variance.
    • Add a step to drop all highly correlated predictors.
    • Add a step to drop all linear combination predictors.
    • Add a step to transform all predictors using the Yeo-Johnson approach.
  4. Set up a model using svm_linear()
    • Tune the cost parameter
    • Set the mode to classification
    • Set the engine to "kernlab"
  5. Combine the model and recipe into a workflow.
  6. Prepare the hyperparameters for tuning.
  7. Tune the hyperparameters using a grid search of size 20.
    • If you are on a weak computer, maybe reduce to size 10.
    • (For reference, my desktop finished size=20 in around 2 minutes.)
  8. Finalize the workflow using the parameters with the best AUC ROC.
  9. Calculate and examine the final model’s testing set metrics. Did it do okay?
  10. Repeat this process but change the model to svm_rbf(). Was this better?
    • You may want to give these new objects new names (to avoid overwriting the older ones).

Answer key

Click here to view the answer key

Part 1

pot_split <- initial_split(water, prop = 0.8, strata = Potability)
pot_train <- training(pot_split)
pot_test <- testing(pot_split)

Part 2

pot_folds <- vfold_cv(pot_train, v = 10, strata = Potability)

Part 3

pot_recipe <-
  recipe(pot_train, formula = Potability ~ .) %>%
  step_mutate(Potability = factor(Potability, levels = c("unsafe", "safe"))) %>%
  step_nzv(all_predictors()) %>%
  step_corr(all_predictors()) %>%
  step_lincomb(all_predictors()) %>%
  step_YeoJohnson(all_predictors())

Part 4

svm_model <-
  svm_linear(cost = tune()) %>%
  set_mode("classification") %>%
  set_engine("kernlab")

Part 5

pot_wflow <-
  workflow() %>%
  add_recipe(pot_recipe) %>%
  add_model(svm_model)

Part 6

pot_param <-
  svm_model %>%
  extract_parameter_set_dials() %>%
  finalize(pot_folds)

Part 7

pot_tune <-
  pot_wflow %>%
  tune_grid(
    resamples = pot_folds,
    grid = 20,
    param_info = pot_param
  )

autoplot(pot_tune)

Part 8

pot_param_final <- select_best(pot_tune, metric = "roc_auc")

pot_wflow_final <- finalize_workflow(pot_wflow, pot_param_final)

Part 9

pot_final <- last_fit(pot_wflow_final, pot_split)

collect_metrics(pot_final)
#> # A tibble: 3 × 4
#>   .metric     .estimator .estimate .config             
#>   <chr>       <chr>          <dbl> <chr>               
#> 1 accuracy    binary         0.596 Preprocessor1_Model1
#> 2 roc_auc     binary         0.474 Preprocessor1_Model1
#> 3 brier_class binary         0.241 Preprocessor1_Model1

An ROC AUC of 0.51 is not very good at all! The accuracy is nearly 60% but then again, around 60% of the water sources were unsafe (so if you just guessed that they were all unsafe, your accuracy would be 60%).

Part 10

rbf_model <-
  svm_rbf(cost = tune()) %>%
  set_mode("classification") %>%
  set_engine("kernlab")

rbf_wflow <-
  workflow() %>%
  add_recipe(pot_recipe) %>%
  add_model(rbf_model)

rbf_param <-
  rbf_model %>%
  extract_parameter_set_dials() %>%
  finalize(pot_folds)

rbf_tune <-
  rbf_wflow %>%
  tune_grid(
    resamples = pot_folds,
    grid = 20,
    param_info = rbf_param
  )

rbf_param_final <- select_best(rbf_tune, metric = "roc_auc")

rbf_wflow_final <- finalize_workflow(rbf_wflow, rbf_param_final)

rbf_final <- last_fit(rbf_wflow_final, pot_split)

collect_metrics(rbf_final)
#> # A tibble: 2 × 4
#>   .metric  .estimator .estimate .config             
#>   <chr>    <chr>          <dbl> <chr>               
#> 1 accuracy binary         0.667 Preprocessor1_Model1
#> 2 roc_auc  binary         0.716 Preprocessor1_Model1

An ROC AUC of 0.73 looks a lot better. The RBF kernel seemed to help a lot.