class: center, middle, inverse, title-slide .title[ #
Support Vector Machines
] .subtitle[ ## 🚧 🛣️ 🦿 ] .author[ ### Machine Learning in R
SMaRT Workshops
] .date[ ### Day 4A Jeffrey Girard ] --- class: inverse, center, middle # Overview --- class: onecol ## Roadmap .left-column[ .pt3[ <img src="data:image/png;base64,#../figs/map.jpg" width="100%" /> ] ] .right-column[ 1. Lecture: Maximal Margin Classifier 🚶 1. Lecture: Support Vector Classifier 🏃 1. Lecture: Support Vector Machine 🚴 1. Lecture: Support Vector Regression 🚵 1. Live Coding: SVM Regression 1. Live Coding: SVM Classification ] --- class: onecol ## Notices and Disclaimers The ideas underlying SVM are really *clever* and *interesting*! 😃 -- SVM is also a good algorithm for *smaller*, *messy* datasets!! 😍 -- However, there is a lot of *terminology* and *math* involved... 😱 -- <p style="padding-top:15px;">I will try to shield you from this and give only <b>the necessities</b></p> - That means there will be some things I need to "hand waive" - I may also need to skip questions with very technical answers -- <p style="padding-top:15px;">But you should get a <b>strong intuition</b> and <b>applied knowledge</b></p> - This will prepare you nicely to dive into a longer course on the topic --- class: inverse, center, middle # SVM Intuitions --- class: onecol ## A Tale of Two Classes If this is our training data, how do we **predict the class** of new data? <img src="data:image/png;base64,#../figs/maxmargin1.png" width="100%" /> --- class: onecol ## Drawing a Line in the Sand With one feature, we could find a **point** that separates the classes (as higher or lower) <img src="data:image/png;base64,#../figs/maxmargin2.png" width="100%" /> --- class: onecol ## Analysis Paralysis But there are many possible decision points, so **which should we use?** <img src="data:image/png;base64,#../figs/maxmargin3.png" width="100%" /> --- class: onecol ## Maximal Margin Classifier (MMC) The MMC algorithm finds and uses the point with the **largest** (i.e., maximal) .imp[margin] <img src="data:image/png;base64,#../figs/maxmargin4.png" width="100%" /> --- class: onecol ## Maximal Margin Classifier If we have two features, we can extend this idea using a 2D plot and a decision **line** <img src="data:image/png;base64,#../figs/maxmargin5.png" width="80%" /> --- class: onecol ## Maximal Margin Classifier If we have three features, we will need a 3D plot and a decision **plane** (i.e., flat surface) <img src="data:image/png;base64,#../figs/3d_plane.gif" width="45%" /> .footnote[[1] Credit to [Zahra Elhamraoui](https://medium.datadriveninvestor.com/support-vector-machine-svm-algorithm-in-a-fun-easy-way-fc23a008c22) for this visualization.] --- class: onecol ## Maximal Margin Classifier If we have four or more features, we will need a decision **hyperplane** -- .bg-light-yellow.b--light-red.ba.bw1.br3.pl4[ **Caution:** You may hurt yourself if you try to imagine what a hyperplane looks like. ] -- .pt1[ **Margins still exist** in higher-dimensional space and we still want to maximize them ] - Our goal is thus to locate the class-separating hyperplane with the largest margin - The math behind this is beyond the scope of our workshop, but that's the idea -- .pt1[ We can still **classify new observations**: which side of the hyperplane do they fall on? ] --- class: onecol ## Maximal Margin Classifier Only the observations that define the margin, called .imp[support vectors], are used Because it only uses a subset of data anyway, MMC does well with smaller datasets <img src="data:image/png;base64,#../figs/maxmargin8.png" width="80%" /> --- class: onecol ## Maximal Margin Classifier This means that **outliers can have an outsized impact** on what is learned For instance, this margin is likely to misclassify examples in new data <img src="data:image/png;base64,#../figs/maxmargin9.png" width="80%" /> --- ## Comprehension Check \#1 .pull-left[ ### Question 1 **What is the maximum number of features you can use with the MMC algorithm?** a) 1 feature b) 2 features c) 3 features d) Any number (no maximum) ] .pull-right[ ### Question 2 **Which observations does the MMC algorithm learn from?** a) The most ambiguous/difficult ones b) The most obvious/easy ones c) A random selection of observations d) All of the observations ] --- class: onecol ## Support Vector Classifier (SVC) The SVC algorithm is like MMC but it **allows examples to be misclassified** (i.e., wrong) This will **increase bias** (training errors) but hopefully **decrease variance** (testing errors) <img src="data:image/png;base64,#../figs/svc1.png" width="80%" /> --- class: onecol ## Support Vector Classifier SVCs also enable a model to be trained when the classes are **not perfectly separable** A straight line is never going to separate these classes without errors (sorry MMC...) <img src="data:image/png;base64,#../figs/svc2.png" width="80%" /> --- class: onecol ## Support Vector Classifier But if we allow a few errors and points within the margin... ...we may be able to find a hyperplane that generalizes pretty well <img src="data:image/png;base64,#../figs/svc3.png" width="80%" /> --- class: onecol ## Support Vector Classifier When points are on the wrong side of the margin, they are called "violations" A **softer margin** allows more violations, whereas a **harder margin** allows fewer SVCs have a hyperparameter `\(C\)` that controls how soft vs. hard the margin is -- - A **lower `\(C\)` value** makes the margin harder (allows fewer violations)<sup>1</sup> As a result, the model has **lower bias** and more flexibility but may overfit - A **higher `\(C\)` value** makes the margin softer (allows more violations) As a result, the model has less flexibility but may also have **lower variance** .footnote[ [1] If you set `\(C=0\)` (i.e., a fully hard margin) SVC will allow no violations and behave the same as MMC.] --- ## Comprehension Check \#2 .pull-left[ ### Question 1 **What is the main difference between the MMC and SVC algorithms?** a) The MMC algorithm uses hyperplanes b) The SVC algorithm allows training errors c) The MMC algorithm uses a softer margin d) The SVC algorithm prevents "violations" ] .pull-right[ ### Question 2 **Which value of `\(C\)` would be least likely to overfit?** a) `\(C=-1\)` b) `\(C=0\)` c) `\(C=1\)` d) `\(C=10\)` ] --- class: onecol ## Support Vector Machine So far, MMC and SVCs have both used linear (e.g., flat) hyperplanes But there are many times when the classes are not **linearly separable** <img src="data:image/png;base64,#../figs/svm1.png" width="80%" /> .footnote[[1] These classes seem separable, but not with a single decision point...] --- class: onecol ## Support Vector Machine But if we enlarge the feature space, the classes might then become linearly separable There are many ways to do this enlarging, but one is to add polynomial expansions <img src="data:image/png;base64,#../figs/svm2.png" width="80%" /> --- class: onecol ## Support Vector Machine The classes are now linearly separable in this new enlarged feature space! The hyperplane is flat in this new space, but would not look so in the original space <img src="data:image/png;base64,#../figs/svm3.png" width="80%" /> --- class: onecol ## Support Vector Machine Here is a more complex example of a nonlinear (and non-polynomial) expansion <img src="data:image/png;base64,#../figs/svm4.png" width="75%" /> .footnote[[1] Credit to [Erik Kim](https://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html) for this example and visualization.] --- class: onecol ## Support Vector Machine And here is the hyperplane: it is linear in 3D but not when "projected" back into 2D <img src="data:image/png;base64,#../figs/svm5.png" width="75%" /> .footnote[[1] Credit to [Erik Kim](https://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html) for this example and visualization.] --- class: onecol ## Support Vector Machine The .imp[Support Vector Machine] (SVM) allows us to efficiently enlarge the feature space - Part of what makes SVMs efficient is they **only consider the support vectors** - They also use **kernel functions** to quantify the similarity of pairs of support vectors<sup>1</sup> .footnote[[1] These similarity estimates are used to efficiently find the optimal hyperplane but that process is complex.] -- <p style="padding-top:15px;">The SVC can actually be considered a simple version of the SVM with a <b>linear kernel</b></p> - A linear kernel essentially quantifies similarity using the Pearson correlation `$$k(x, x') = \langle x, x'\rangle$$` -- <p style="padding-top:15px;">Linear kernels are efficient but <b>nonlinear kernels</b> may provide better performance</b></p> --- exclude: true class: onecol ## Support Vector Machine It is common to also use **nonlinear** kernels, such as the .imp[polynomial kernel] `$$k(x, x')=(\text{scale} \cdot \langle x, x' \rangle + \text{offset})^\text{degree}$$` With larger values of `\(\text{degree}\)`, the decision boundary can become more complex - You are essentially adding polynomial expansions of `\(\text{degree}\)` to each predictor - You have expanded the feature space and may now have linear separation - This is the same idea we just used in fitting a hyperplane in the `\(x\text{-by-}x^2\)` space! If you center or normalize all your predictors, you can drop the `\(\text{offset}\)` term .footnote[[1] When `\\(\text{degree}=1\\)`, the polynomial kernel reduces to the linear kernel and SVM becomes SVC again.] --- class: onecol ## Support Vector Machine Perhaps the most common nonlinear kernel is the .imp[radial basis function] (RBF) `$$k(x, x') = \exp\left(-\sigma \|x-y\|^2\right)$$` -- The intuition here is that similarity is weighted by how *close* the observations are - Only support vectors near new observations influence classification strongly - As the `\(\sigma\)` hyperparameter<sup>1</sup> increases, the more *local* and complex fit becomes -- .pt1[ The RBF kernel computes similarity between points in *infinite* dimensions 🤯 - It is considered an ideal "general purpose" nonlinear kernel<sup>2</sup> ] -- .footnote[ [1] Note that `\(\sigma\)` is also sometimes called `\(\gamma\)` or the "scale" hyperparameter.<br /> [2] Other nonlinear kernels are popular for special purposes and in specialized subfields. ] --- ## Comprehension Check \#3 .pull-left[ ### Question 1 **What is the main difference between the SVM and SVC algorithms?** a) The SVM algorithm uses kernel functions b) SVM will automatically square your features c) SVC is just the SVM with an RBF kernel d) They are the same exact thing, silly! ] .pull-right[ ### Question 2 **What are the hyperparameters for an SVM with a Radial kernel?** a) `\(\sigma\)` and `\(\gamma\)` b) `\(C\)` and `\(\sigma\)` c) `\(\alpha\)` and `\(\lambda\)` d) `\(C\)` and `\(\Delta\)` ] --- class: onecol ## Support Vector Regression Machine (SVR) So far we have been concerned only with classification, but what about regression? In another very clever turn, you can **adapt SVM for regression** by reversing its goal - Instead of trying to separate classes and keep data points outside of the margin... - ...SVR instead tries to **keep all the data points** .imp[inside the margin] (with no classes) .center[ <img src="data:image/png;base64,#../figs/svr.png" width="50%" /> ] --- class: onecol ## Support Vector Regression Machine All the benefits of the SVM algorithm can be applied to the regression problem - Using a subset of the data points to define the margin, i.e., **support vectors** - Allowing violations (now points *outside* the margin) with a **soft margin** - Efficiently expanding the feature space through the use of **various kernels** .pv1[ We can use the same model type(s) and hyperparameters for SVM and SVR - We will use `\(C\)` to control the softness vs. hardness of the margin - If we use an RBF kernel, we will use `\(\sigma\)` to control the kernel scaling - Optionally, we can tune the `\(\varepsilon\)` hyperparameter to control the margin size ] --- class: onecol ## Support Vector Regression Machine <img src="data:image/png;base64,#../figs/epsilonsvr.png" width="90%" /> --- class: inverse, middle, center # Live Coding --- ## Live Coding: SVM Regression .scroll.h-0l[ ``` r library(tidyverse) library(tidymodels) tidymodels_prefer() # Install model package (if needed) install.packages("kernlab") # Set random number generation seed for reproducibility set.seed(2022) # Read in dataset penguins <- palmerpenguins::penguins %>% na.omit() # Create initial split for holdout validation bmg_split <- initial_split(penguins, prop = 0.8, strata = body_mass_g) bmg_train <- training(bmg_split) bmg_test <- testing(bmg_split) # Create 10-fold CV within training set for tuning bmg_folds <- vfold_cv(bmg_train, v = 10, repeats = 1, strata = body_mass_g) # Set up recipe based on the data and model bmg_recipe <- recipe(bmg_train, formula = body_mass_g ~ .) %>% step_mutate(year = factor(year)) %>% step_dummy(all_nominal_predictors()) %>% step_nzv(all_predictors()) %>% step_corr(all_predictors()) %>% step_lincomb(all_predictors()) %>% step_normalize(all_predictors()) %>% step_YeoJohnson(all_predictors()) # Set up model with tuning parameters svr_model <- svm_rbf(cost = tune(), margin = tune()) %>% set_mode("regression") %>% set_engine("kernlab") # Combine recipe and model specification into a workflow bmg_wflow <- workflow() %>% add_recipe(bmg_recipe) %>% add_model(svr_model) # Pick reasonable boundaries for the tuning parameters bmg_param <- svr_model %>% extract_parameter_set_dials() %>% finalize(bmg_folds) # If desired, view the reasonable boundary values for the tuning parameters extract_parameter_dials(bmg_param, "cost") extract_parameter_dials(bmg_param, "margin") # Create list of values within boundaries and grid search (may take a bit) bmg_tune <- bmg_wflow %>% tune_grid( resamples = bmg_folds, grid = 20, param_info = bmg_param ) # If desired, plot the tuning results autoplot(bmg_tune) # Select the best parameters values bmg_param_final <- select_best(bmg_tune, metric = "rmse") bmg_param_final # Finalize the workflow with the best parameter values bmg_wflow_final <- bmg_wflow %>% finalize_workflow(bmg_param_final) bmg_wflow_final # Fit the finalized workflow to the training set and evaluate in testing set bmg_final <- bmg_wflow_final %>% last_fit(bmg_split) # View the metrics (from the holdout test set) collect_metrics(bmg_final) # Collect the predictions bmg_pred <- collect_predictions(bmg_final) bmg_pred # Create regression prediction plot ggplot(bmg_pred, aes(x = body_mass_g, y = .pred)) + geom_point(alpha = 0.33) + geom_abline(color = "darkred") + coord_obs_pred() + labs(x = "Observed Body Mass (g)", y = "Predicted Body Mass (g)") # Note that vip doesn't work with SVM currently ``` ] --- ## Live Coding: SVM Classification .scroll.h-0l[ ``` r # Set random number generation seed for reproducibility set.seed(2022) # Read in dataset peng <- palmerpenguins::penguins # Create initial split for holdout validation psp_split <- initial_split(peng, prop = 0.8, strata = species) psp_train <- training(psp_split) psp_test <- testing(psp_split) # Create 10-fold CV within training set for tuning psp_folds <- vfold_cv(psp_train, v = 10, repeats = 1, strata = species) # Set up recipe based on the data and model summary(peng) psp_recipe <- recipe(psp_train) %>% update_role(species, new_role = "outcome") %>% update_role(island, sex, year, new_role = "predictor") %>% update_role(bill_length_mm:body_mass_g, new_role = "ignore") %>% step_mutate(year = factor(year)) %>% step_impute_knn(sex) %>% step_dummy(all_nominal_predictors()) %>% step_nzv(all_predictors()) %>% step_corr(all_predictors()) %>% step_lincomb(all_predictors()) # Set up model with tuning parameters svm_model <- svm_rbf(cost = tune()) %>% set_mode("classification") %>% set_engine("kernlab") # Combine recipe and model specification into a workflow psp_wflow <- workflow() %>% add_recipe(psp_recipe) %>% add_model(svm_model) # Pick reasonable boundaries for the tuning parameters psp_param <- svm_model %>% extract_parameter_set_dials() %>% finalize(psp_folds) # If desired, view the reasonable boundary values for the tuning parameters extract_parameter_dials(psp_param, "cost") # Create list of values within boundaries and grid search (may take a bit) psp_tune <- psp_wflow %>% tune_grid( resamples = psp_folds, grid = 20, param_info = psp_param ) # If desired, plot the tuning results autoplot(psp_tune) # Select the best parameters values psp_param_final <- select_best(psp_tune, metric = "roc_auc") psp_param_final # Finalize the workflow with the best parameter values psp_wflow_final <- psp_wflow %>% finalize_workflow(psp_param_final) psp_wflow_final # Fit the finalized workflow to the training set and evaluate in testing set psp_final <- psp_wflow_final %>% last_fit(psp_split) # View the metrics (from the holdout test set) collect_metrics(psp_final) # Collect the predictions psp_pred <- collect_predictions(psp_final) psp_pred # Calculate and plot confusion matrix psp_cm <- conf_mat(psp_pred, truth = species, estimate = .pred_class) psp_cm autoplot(psp_cm, type = "heatmap") summary(psp_cm) # Plot the ROC curve psp_rc <- roc_curve(psp_pred, truth = species, .pred_Adelie:.pred_Gentoo) autoplot(psp_rc) # Note that vip doesn't work with SVM currently ``` ] --- class: inverse, center, middle # Time for a Break!
−
+
10
:
00