Regularization and Elastic Net

class: center, middle, inverse, title-slide

.title[
# <span style="font-size:48pt;">Regularization and Elastic Net</span>
]
.subtitle[
## .big[ ⛰️ 🤠 🕸️ ]
]
.author[
### Machine Learning in R<br /><i>SMaRT Workshops</i>
]
.date[
### Day 3A     Shirley Wang
]

---

class: inverse, center, middle
# Overview

---
class: onecol
## Lecture Topics

So far, we have learned all about machine learning **methods**.

Today we will take our first dive into machine learning .imp[algorithms]!

This lecture will cover **regularized regression**, including:

- Ridge regression

- Lasso regression

- Elastic net regression

We will review the **theory and rationale** for regularization.

The next lecture (3B) will show a worked example in R.

---
class: inverse, center, middle
# Linear Regression Review

---
class: onecol
## Linear Regression

Linear regression and closely related models (ridge, lasso, elastic net) can be written as:

`$$y_i = b_0 + b_1x_{i1} + b_2x_{i2} + ... + b_Px_{iP} + e_i$$`
where: 
- `$y_i$`: value of the response for the `$i$`th observation
- `$b_0$`: estimated intercept 
- `$b_j$`: estimated parameter for the `$j$`th predictor
- `$x_{ij}$`: value of the `$j$`th predictor for the `$i$`th observation
- `$e_i$`: random error unexplained by the model for the `$i$`th observation

---
class: onecol 
## Ordinary Least Squares Regression

In OLS regression, the parameters are estimated to .imp[minimize model bias].

Unfortunately, this comes at the expense of .imp[increasing model variance]<sup>1</sup>.

.footnote[
[1] Remember that model bias is a lack of predictive accuracy in original data, whereas model variance is a lack of predictive accuracy in new data.  
]

<p style="padding-top:30px;">Specifically, OLS regression aims to minimize the **sum-of-squared errors (SSE)**:

`$$SSE = \sum\limits_{i = 1}^n(y_i - \hat{y_i})^2$$`

That is, it always attempts to **minimize error** between observed vs. predicted values.

---
class: onecol
## A Problem

Any dataset is influenced by the underlying data-generating process and sampling error.

By definition, sampling error varies between samples drawn from the same population.

Therefore, the sampling error in one dataset may not generalize to new data.

<p style="padding-top:30px;">Aiming to make our predictions as close to observed data as possible can be risky.

We might be .imp[overfitting] to sampling error or other forms of noise.

---
class: onecol
## Pros and Cons

OLS regression is .imp[interpretable] and easy to compute, but has important limitations:

- Risk of overfitting; poor predictive accuracy in new datasets

- Inflated parameter estimates

- Sensitivity to outliers<sup>1</sup>

- Difficulty handling datasets with high multicollinearity

- Difficulty handling datasets with more predictors than observations

.footnote[
[1] OLS adjusts parameter estimates to accommodate outlier observations with large residuals, to minimize SSE. 
]

.bg-light-green.b--dark-green.ba.bw1.br3.pl4[
Regularization addresses many of these problems.  
]

---
class: inverse, center, middle
# Regularization

---
class: onecol
## What is Regularization?

Regularization adds an additional .imp[penalty term] to the loss function<sup>1</sup>.

This has the effect of **shrinking slopes** towards zero.

Compared to OLS models, regularized models have a .imp[higher bias but lower variance].

.footnote[
[1] Recall that the error function for OLS regression is the sum-of-squared errors (SSE). This is what the model tries to minimize when it is being fit.
]

<p style="padding-top:30px;">In other words, regularization makes a model **less sensitive to the training data**.

This allows it to achieve **higher accuracy in the test set**.

.bg-light-green.b--dark-green.ba.bw1.br3.pl4[
Therefore, one major benefit of regularization is reducing overfitting. 
]

---
class: twocol
## Another Benefit of Regularization

.left-column.pv3[
<img src="data:image/png;base64,#../figs/feature_selection.png" width="100%" />
]

.right-column[
**Feature Selection**

We're often interested in finding a subset of "good" predictors.

A traditional approach is stepwise regression.

However, there are many problems with stepwise methods<sup>1</sup>.

Regularization shrinks slope estimates towards zero<sup>2</sup>.

Thus, it both reduces overfitting and performs feature selection.

]

.footnote[
[1] see [Harrell (2015)](https://link.springer.com/book/10.1007/978-3-319-19425-7), section 4.3, for more details and explanation about problems with stepwise regression.

[2] In some cases, such as in lasso regression, some parameters are actually set to zero.
]

---
class: onecol
## Comprehension Check \#1

.pull-left[
### Question 1
**How is regularization different from OLS regression?**

a) It involves cross-validation.

b) It adds a penalty term to loss function.

c) It adds variance into the model.

d) It is only for large datasets.
]

.pull-right[
### Question 2
**Which is .imp[not] a benefit of regularized compared to nonregularized models?**

a) Feature selection

b) Improves out-of-sample prediction

c) Overcomes measurement errors

d) Limits overfitting
]

---
class: inverse, center, middle
# Ridge Regression

---
class: onecol
## Ridge Regression

Recall the loss function for OLS regression:

`$$SSE = \sum\limits_{i = 1}^n(y_i - \hat{y_i})^2$$`

The loss function for ridge **also contains the same loss function**.

The only difference is that we now have additional term, known as the `$L_2$` penalty:

`$$SSE_{L2} = \sum\limits_{i = 1}^n(y_i - \hat{y_i})^2 + \lambda \sum\limits_{j = 1}^P \beta_j^2$$`
Here, `$P$` = number of predictors and `$\beta_j$` = slope of the `$j^{th}$` predictor

---
class: twocol
## Ridge Regression

.left-column.pv3[
<img src="data:image/png;base64,#../figs/ridge.png" width="100%" />
]

.right-column[
OLS regression aims to **minimize the sum of squared errors**.

Ridge also aims to **minimize the squared value of all slopes**.

This means that slopes can become large only if there is a proportional reduction in `$SSE_{L2}$`.

`$\lambda$` is a .imp[hyperparameter] controlling the degree of regularization.

Higher values of `$\lambda$` **shrinks slopes** closer to zero.

We can find the 'best' value of `$\lambda$` through **cross-validation tuning**.
]

---
class: inverse, center, middle 
# Lasso Regression

---
class: onecol
## Lasso Regression

Lasso stands for the .imp[Least Absolute Shrinkage and Selection Operator].

Similar to ridge, lasso adds an additional penalty term to the OLS loss function:

`$$SSE_{L1} = \sum\limits_{i = 1}^n(y_i - \hat{y_i})^2 + \lambda \sum\limits_{j = 1}^P \lvert \beta_j \rvert$$`

This is also known as the `$L_1$` penalty.

Whereas ridge aims to minimize the square of slopes, lasso aims to minimize the **absolute value** of all slopes.

---
class: twocol
## Lasso Regression

.left-column.pv3[
<img src="data:image/png;base64,#../figs/lasso.png" width="100%" />
]

.right-column[
The differences between the ridge and lasso penalty may seem small.

However, they have some important effects.

If `$\lambda$` is set high enough in lasso, all slopes will be **shrunk to zero**.

On the other hand, high ridge `$\lambda$` only shrinks slopes *towards* zero.

Ridge and lasso also differ in their handling of **multicollinearity**.

Whereas ridge tends to shrink slopes of correlated predictors towards each other, lasso tends to pick one and ignore the rest.
]

---
class: inverse, center, middle
# Elastic Net Regression

---
class: onecol
## Elastic Net Regression

Elastic net regression combines both `$L1$` (lasso) and `$L2$` (ridge) penalty terms together:

`$$SSE_{EN} = \sum\limits_{i = 1}^n(y_i - \hat{y_i})^2 + \lambda_1 \sum\limits_{j = 1}^P \lvert \beta_j \rvert + \lambda_2 \sum\limits_{j = 1}^P \beta_j^2$$`

<p style="padding-top:30px;">This loss function now includes three terms: 
- The OLS loss function (sum of squared errors)

- The lasso penalty (sum of absolute value of coefficients)

- The ridge penalty (sum of squared coefficients)

---
class: twocol
## Elastic Net Regression

.left-column.pv3[
<img src="data:image/png;base64,#../figs/elasticnet.png" width="100%" />
]

.right-column[
Elastic net provides a .imp[mix between ridge and lasso] regression.

It provides ridge-like regularization with lasso-like feature selection.

Elastic net is particularly good at handling correlated predictors.

It also adds another **hyperparameter** into the mix.

We have `$\lambda$` (penalty hyperparameter) and `$\alpha$` (mixing hyperparameter).

The new `$\alpha$` mixing parameter ranges from [0, 1].

At `$\alpha = 0$`, ridge is performed, and at `$\alpha = 1$`, lasso is performed. 
]

---
class: onecol
## Some Notes on Regularization

As we have seen, regularized regression is **very similar to OLS regression**.

We can write all regularized regression in the **same form** as OLS regression:

`$$y_i = b_0 + b_1x_{i1} + b_2x_{i2} + ... + b_Px_{iP} + e_i$$`

<p style="padding-top:30px;">Ridge, Lasso, and Elastic net are all still **linear models**<sup>1</sup>.

For this reason, regularized regression remains very **interpretable**.

.footnote[
[1] This means that each parameter (e.g., `$b_1, b_2$`) only appears with a power of 1 and is not multiplied or divided by another parameter. Nonlinear *variables* (e.g., `$x_1^2$`) can still be included as long as the *parameters* remain linear.
]

.bg-light-green.b--dark-green.ba.bw1.br3.pl4[
**Advice**: Regularization is a good ML option when working with smaller datasets.
]

---
class: onecol
## Regularization for Classification

Regularization works for .imp[both regression and classification problems].

Recall that **logistic regression** predicts the probability of a binary event, e.g.:

- Email is spam or not spam.

- This photo contains or does not contain a dog.

- A patient has or does not have a disease.

<p style="padding-top:30px;">We typically<sup>1</sup> classify observations where `$P(Y = 1) \geq 0.5$` in the `$Y = 1$` group.

Observations where `$P(Y = 1) < 0.5$` are typically classified in `$Y = 0$` group.

.footnote[
[1] However, thresholds other than 0.5 can be chosen. 
]

---
class: onecol
## Regularization for Classification

Logistic regression uses a different loss function, as the outcome is dichotomous.

.pull-left[
.center[**Regression**]
<img src="data:image/png;base64,#slides_3a_files/figure-html/regression_example-1.png" width="100%" />
]
--
.pull-right[
.center[**Classification**]
<img src="data:image/png;base64,#slides_3a_files/figure-html/classification_example-1.png" width="100%" />

]
--

.bg-light-green.b--dark-green.ba.bw1.br3.pl4[Ridge, lasso, and elastic net penalties can also be added to the logistic loss function. These will have the same effect of .imp[shrinking coefficients towards zero].
]

---
class: onecol
## Comprehension Check \#2

.pull-left[
### Question 1
**Which model can shrink coefficients fully to zero?**

a) Ridge regression

b) Lasso regression

c) OLS regression

d) None of the above
]

.pull-right[
### Question 2
**What do the `$\lambda$` and `$\alpha$` hyperparameters correspond to?**

a) `$\lambda$` = penalty, `$\alpha$` = validation

b) `$\lambda$` = feature selection, `$\alpha$` = mixing

c) `$\lambda$` = penalty, `$\alpha$` = mixing

d) `$\lambda$` = feature selection, `$\alpha$` = penalty
]

---
class: inverse, center, middle
# Time for a Break!
<div class="countdown" id="timer_4131f8cd" data-warn-when="60" data-update-every="1" tabindex="0" style="right:33%;bottom:15%;left:33%;">
<div class="countdown-controls"><button class="countdown-bump-down">−</button><button class="countdown-bump-up">+</button></div>
<code class="countdown-time"><span class="countdown-digits minutes">10</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>