class: center, middle, inverse, title-slide .title[ #
Regularization and Elastic Net
] .subtitle[ ## .big[ ⛰️ 🤠 🕸️ ] ] .author[ ### Machine Learning in R
SMaRT Workshops
] .date[ ### Day 3A Shirley Wang ] --- class: inverse, center, middle # Overview --- class: onecol ## Lecture Topics So far, we have learned all about machine learning **methods**. Today we will take our first dive into machine learning .imp[algorithms]! -- This lecture will cover **regularized regression**, including: - Ridge regression - Lasso regression - Elastic net regression We will review the **theory and rationale** for regularization. The next lecture (3B) will show a worked example in R. --- class: inverse, center, middle # Linear Regression Review --- class: onecol ## Linear Regression Linear regression and closely related models (ridge, lasso, elastic net) can be written as: `$$y_i = b_0 + b_1x_{i1} + b_2x_{i2} + ... + b_Px_{iP} + e_i$$` where: - `\(y_i\)`: value of the response for the `\(i\)`th observation - `\(b_0\)`: estimated intercept - `\(b_j\)`: estimated parameter for the `\(j\)`th predictor - `\(x_{ij}\)`: value of the `\(j\)`th predictor for the `\(i\)`th observation - `\(e_i\)`: random error unexplained by the model for the `\(i\)`th observation --- class: onecol ## Ordinary Least Squares Regression In OLS regression, the parameters are estimated to .imp[minimize model bias]. Unfortunately, this comes at the expense of .imp[increasing model variance]<sup>1</sup>. .footnote[ [1] Remember that model bias is a lack of predictive accuracy in original data, whereas model variance is a lack of predictive accuracy in new data. ] -- <p style="padding-top:30px;">Specifically, OLS regression aims to minimize the **sum-of-squared errors (SSE)**: `$$SSE = \sum\limits_{i = 1}^n(y_i - \hat{y_i})^2$$` That is, it always attempts to **minimize error** between observed vs. predicted values. --- class: onecol ## A Problem Any dataset is influenced by the underlying data-generating process and sampling error. By definition, sampling error varies between samples drawn from the same population. Therefore, the sampling error in one dataset may not generalize to new data. -- <p style="padding-top:30px;">Aiming to make our predictions as close to observed data as possible can be risky. We might be .imp[overfitting] to sampling error or other forms of noise. --- class: onecol ## Pros and Cons OLS regression is .imp[interpretable] and easy to compute, but has important limitations: - Risk of overfitting; poor predictive accuracy in new datasets - Inflated parameter estimates - Sensitivity to outliers<sup>1</sup> - Difficulty handling datasets with high multicollinearity - Difficulty handling datasets with more predictors than observations .footnote[ [1] OLS adjusts parameter estimates to accommodate outlier observations with large residuals, to minimize SSE. ] -- .bg-light-green.b--dark-green.ba.bw1.br3.pl4[ Regularization addresses many of these problems. ] --- class: inverse, center, middle # Regularization --- class: onecol ## What is Regularization? Regularization adds an additional .imp[penalty term] to the loss function<sup>1</sup>. This has the effect of **shrinking slopes** towards zero. Compared to OLS models, regularized models have a .imp[higher bias but lower variance]. .footnote[ [1] Recall that the error function for OLS regression is the sum-of-squared errors (SSE). This is what the model tries to minimize when it is being fit. ] -- <p style="padding-top:30px;">In other words, regularization makes a model **less sensitive to the training data**. This allows it to achieve **higher accuracy in the test set**. -- .bg-light-green.b--dark-green.ba.bw1.br3.pl4[ Therefore, one major benefit of regularization is reducing overfitting. ] --- class: twocol ## Another Benefit of Regularization .left-column.pv3[ <img src="data:image/png;base64,#../figs/feature_selection.png" width="100%" /> ] .right-column[ **Feature Selection** We're often interested in finding a subset of "good" predictors. A traditional approach is stepwise regression. However, there are many problems with stepwise methods<sup>1</sup>. Regularization shrinks slope estimates towards zero<sup>2</sup>. Thus, it both reduces overfitting and performs feature selection. ] .footnote[ [1] see [Harrell (2015)](https://link.springer.com/book/10.1007/978-3-319-19425-7), section 4.3, for more details and explanation about problems with stepwise regression. [2] In some cases, such as in lasso regression, some parameters are actually set to zero. ] --- class: onecol ## Comprehension Check \#1 .pull-left[ ### Question 1 **How is regularization different from OLS regression?** a) It involves cross-validation. b) It adds a penalty term to loss function. c) It adds variance into the model. d) It is only for large datasets. ] .pull-right[ ### Question 2 **Which is .imp[not] a benefit of regularized compared to nonregularized models?** a) Feature selection b) Improves out-of-sample prediction c) Overcomes measurement errors d) Limits overfitting ] --- class: inverse, center, middle # Ridge Regression --- class: onecol ## Ridge Regression Recall the loss function for OLS regression: `$$SSE = \sum\limits_{i = 1}^n(y_i - \hat{y_i})^2$$` -- The loss function for ridge **also contains the same loss function**. -- The only difference is that we now have additional term, known as the `\(L_2\)` penalty: `$$SSE_{L2} = \sum\limits_{i = 1}^n(y_i - \hat{y_i})^2 + \lambda \sum\limits_{j = 1}^P \beta_j^2$$` Here, `\(P\)` = number of predictors and `\(\beta_j\)` = slope of the `\(j^{th}\)` predictor --- class: twocol ## Ridge Regression .left-column.pv3[ <img src="data:image/png;base64,#../figs/ridge.png" width="100%" /> ] .right-column[ OLS regression aims to **minimize the sum of squared errors**. Ridge also aims to **minimize the squared value of all slopes**. This means that slopes can become large only if there is a proportional reduction in `\(SSE_{L2}\)`. `\(\lambda\)` is a .imp[hyperparameter] controlling the degree of regularization. Higher values of `\(\lambda\)` **shrinks slopes** closer to zero. We can find the 'best' value of `\(\lambda\)` through **cross-validation tuning**. ] --- class: inverse, center, middle # Lasso Regression --- class: onecol ## Lasso Regression Lasso stands for the .imp[Least Absolute Shrinkage and Selection Operator]. Similar to ridge, lasso adds an additional penalty term to the OLS loss function: `$$SSE_{L1} = \sum\limits_{i = 1}^n(y_i - \hat{y_i})^2 + \lambda \sum\limits_{j = 1}^P \lvert \beta_j \rvert$$` -- This is also known as the `\(L_1\)` penalty. Whereas ridge aims to minimize the square of slopes, lasso aims to minimize the **absolute value** of all slopes. --- class: twocol ## Lasso Regression .left-column.pv3[ <img src="data:image/png;base64,#../figs/lasso.png" width="100%" /> ] .right-column[ The differences between the ridge and lasso penalty may seem small. However, they have some important effects. If `\(\lambda\)` is set high enough in lasso, all slopes will be **shrunk to zero**. On the other hand, high ridge `\(\lambda\)` only shrinks slopes *towards* zero. Ridge and lasso also differ in their handling of **multicollinearity**. Whereas ridge tends to shrink slopes of correlated predictors towards each other, lasso tends to pick one and ignore the rest. ] --- class: inverse, center, middle # Elastic Net Regression --- class: onecol ## Elastic Net Regression Elastic net regression combines both `\(L1\)` (lasso) and `\(L2\)` (ridge) penalty terms together: `$$SSE_{EN} = \sum\limits_{i = 1}^n(y_i - \hat{y_i})^2 + \lambda_1 \sum\limits_{j = 1}^P \lvert \beta_j \rvert + \lambda_2 \sum\limits_{j = 1}^P \beta_j^2$$` -- <p style="padding-top:30px;">This loss function now includes three terms: - The OLS loss function (sum of squared errors) - The lasso penalty (sum of absolute value of coefficients) - The ridge penalty (sum of squared coefficients) --- class: twocol ## Elastic Net Regression .left-column.pv3[ <img src="data:image/png;base64,#../figs/elasticnet.png" width="100%" /> ] .right-column[ Elastic net provides a .imp[mix between ridge and lasso] regression. It provides ridge-like regularization with lasso-like feature selection. Elastic net is particularly good at handling correlated predictors. It also adds another **hyperparameter** into the mix. We have `\(\lambda\)` (penalty hyperparameter) and `\(\alpha\)` (mixing hyperparameter). The new `\(\alpha\)` mixing parameter ranges from [0, 1]. At `\(\alpha = 0\)`, ridge is performed, and at `\(\alpha = 1\)`, lasso is performed. ] --- class: onecol ## Some Notes on Regularization As we have seen, regularized regression is **very similar to OLS regression**. We can write all regularized regression in the **same form** as OLS regression: `$$y_i = b_0 + b_1x_{i1} + b_2x_{i2} + ... + b_Px_{iP} + e_i$$` -- <p style="padding-top:30px;">Ridge, Lasso, and Elastic net are all still **linear models**<sup>1</sup>. For this reason, regularized regression remains very **interpretable**. .footnote[ [1] This means that each parameter (e.g., `\(b_1, b_2\)`) only appears with a power of 1 and is not multiplied or divided by another parameter. Nonlinear *variables* (e.g., `\(x_1^2\)`) can still be included as long as the *parameters* remain linear. ] -- .bg-light-green.b--dark-green.ba.bw1.br3.pl4[ **Advice**: Regularization is a good ML option when working with smaller datasets. ] --- class: onecol ## Regularization for Classification Regularization works for .imp[both regression and classification problems]. Recall that **logistic regression** predicts the probability of a binary event, e.g.: - Email is spam or not spam. - This photo contains or does not contain a dog. - A patient has or does not have a disease. -- <p style="padding-top:30px;">We typically<sup>1</sup> classify observations where `\(P(Y = 1) \geq 0.5\)` in the `\(Y = 1\)` group. Observations where `\(P(Y = 1) < 0.5\)` are typically classified in `\(Y = 0\)` group. .footnote[ [1] However, thresholds other than 0.5 can be chosen. ] --- class: onecol ## Regularization for Classification Logistic regression uses a different loss function, as the outcome is dichotomous. .pull-left[ .center[**Regression**] <img src="data:image/png;base64,#slides_3a_files/figure-html/regression_example-1.png" width="100%" /> ] -- .pull-right[ .center[**Classification**] <img src="data:image/png;base64,#slides_3a_files/figure-html/classification_example-1.png" width="100%" /> ] -- .bg-light-green.b--dark-green.ba.bw1.br3.pl4[Ridge, lasso, and elastic net penalties can also be added to the logistic loss function. These will have the same effect of .imp[shrinking coefficients towards zero]. ] --- class: onecol ## Comprehension Check \#2 .pull-left[ ### Question 1 **Which model can shrink coefficients fully to zero?** a) Ridge regression b) Lasso regression c) OLS regression d) None of the above ] .pull-right[ ### Question 2 **What do the `\(\lambda\)` and `\(\alpha\)` hyperparameters correspond to?** a) `\(\lambda\)` = penalty, `\(\alpha\)` = validation b) `\(\lambda\)` = feature selection, `\(\alpha\)` = mixing c) `\(\lambda\)` = penalty, `\(\alpha\)` = mixing d) `\(\lambda\)` = feature selection, `\(\alpha\)` = penalty ] --- class: inverse, center, middle # Time for a Break!
−
+
10
:
00