Setup
Hands-on Activity
Our goal is to build a model to predict Potability
.
- Split the data into 80% training and 20% testing, stratified by the outcome variable.
- Prepare a 10-fold cross-validation within the training set, similarly stratified.
- Create a recipe that predicts the outcome variable from all other variables.
- Add a step to drop all predictors with near-zero variance.
- Add a step to drop all highly correlated predictors.
- Add a step to drop all linear combination predictors.
- Add a step to transform all predictors using the Yeo-Johnson approach.
-
Set up a model using
svm_linear()
-
Tune the
cost
parameter - Set the mode to classification
-
Set the engine to
"kernlab"
- Combine the model and recipe into a workflow.
- Prepare the hyperparameters for tuning.
- Tune the hyperparameters using a grid search of size 20.
- If you are on a weak computer, maybe reduce to size 10.
- (For reference, my desktop finished size=20 in around 2 minutes.)
- Finalize the workflow using the parameters with the best AUC ROC.
- Calculate and examine the final model’s testing set metrics. Did it do okay?
-
Repeat this process but change the model to
svm_rbf()
. Was this better? - You may want to give these new objects new names (to avoid overwriting the older ones).
Answer key
An ROC AUC of 0.51 is not very good at all! The accuracy is nearly
60% but then again, around 60% of the water sources were unsafe (so if
you just guessed that they were all unsafe, your accuracy would be
60%). An ROC AUC of 0.73 looks a lot better. The RBF kernel seemed to help
a lot.
Click here to view the answer key
Part 1
Part 3
Part 4
Part 7
Part 8
Part 9
pot_final <- last_fit(pot_wflow_final, pot_split)
collect_metrics(pot_final)
#> # A tibble: 3 × 4
#> .metric .estimator .estimate .config
#> <chr> <chr> <dbl> <chr>
#> 1 accuracy binary 0.596 Preprocessor1_Model1
#> 2 roc_auc binary 0.474 Preprocessor1_Model1
#> 3 brier_class binary 0.241 Preprocessor1_Model1
Part 10
rbf_model <-
svm_rbf(cost = tune()) %>%
set_mode("classification") %>%
set_engine("kernlab")
rbf_wflow <-
workflow() %>%
add_recipe(pot_recipe) %>%
add_model(rbf_model)
rbf_param <-
rbf_model %>%
extract_parameter_set_dials() %>%
finalize(pot_folds)
rbf_tune <-
rbf_wflow %>%
tune_grid(
resamples = pot_folds,
grid = 20,
param_info = rbf_param
)
rbf_param_final <- select_best(rbf_tune, metric = "roc_auc")
rbf_wflow_final <- finalize_workflow(rbf_wflow, rbf_param_final)
rbf_final <- last_fit(rbf_wflow_final, pot_split)
collect_metrics(rbf_final)
#> # A tibble: 2 × 4
#> .metric .estimator .estimate .config
#> <chr> <chr> <dbl> <chr>
#> 1 accuracy binary 0.667 Preprocessor1_Model1
#> 2 roc_auc binary 0.716 Preprocessor1_Model1