class: center, middle, inverse, title-slide .title[ #
Practical Issues
] .subtitle[ ## .big[🎨 🔬 👷️️] ] .author[ ### Machine Learning in R
SMaRT Workshops
] .date[ ### Day 4B Shirley Wang ] --- class: inverse, center, middle # Overview --- class: onecol ## Lecture Topics This lecture will cover some **practical issues** involved in applied ML research. We will also highlight **advanced topics** for future learning beyond this course. -- <p style="padding-top:30px;">We will discuss: - Considerations for study design - Selecting and comparing algorithms - Algorithmic bias, fairness, and representativeness - Critically evaluating results from ML models - Reading & reviewing ML papers --- class: inverse, center, middle # Designing Studies --- class: onecol ## Asking Predictive Questions As social & behavioral scientists, we aim to **explain and predict** behavior. An implicit assumption is that better explanation will lead to better prediction. However, statistically, this is not always the case (e.g., due to **overfitting**). Shifting from an **explanatory/inferrential** mindset to a **predictive** mindset isn't easy! .footnote[ I highly recommend Yarkoni & Westfall (2017), Choosing Prediction over Explanation in Psychology: Lessons from Machine Learning. ] --- class: onecol ## Asking Predictive Questions **Question: Can we infer people's personalities from their social media usage?** - *Inferential mindset*: test for statistically significant relationships between personality dimensions and other variables (e.g., ratings of someone's Twitter profile). - *Predictive mindset*: build an ML model with the goal of predicting someone's scores on a personality measure from social media data. -- <p style="padding-top:30px;">**Question: How likely is someone to recover from an anxiety disorder?** - *Inferential mindset*: identify variables at time 1 that have a statistically significant relationship with recovery at time 2. - *Predictive mindset*: build an ML model with the goal of using time 1 data to accurately predict anxiety scores at time 2. --- class: onecol ## Cause and Effect Machine learning is a **data-driven method**. However, this does not mean that it is **atheoretical**. -- <p style="padding-top:30px;">Strong understanding of underlying causal structure of our variables is still important. Models are more accurate if features are **causes** rather than **effects** of the outcome<sup>1</sup>. Design of ML studies should be driven by strong theory, including: - Feature selection - Measurement - Time frame (for longitudinal prediction) .footnote[ [1] See Piccininni et al. (2020) for theoretical explanation and simulation results: ] --- class: onecol ## Measurement *"Throwing the same set of poorly measured variables that have been analyzed before into machine learning algorithms is highly unlikely to produce new insights or findings."<sup>1</sup>* .footnote[ [1] Jacobucci & Grimm (2020); *Perspectives on Psychological Science* </br> [2] Flake & Fried (2020); *Advances in Methods and Practices in Psychological Science* ] -- <p style="padding-top:30px;">**Questionable measurement practices**<sup>2</sup> include: - Unclear definitions of constructs - Lack of reliability and validity of measures - Using scales in ways they were not intended -- <p style="padding-top:30px;">Measurement error can prevent ML from accurately modeling nonlinear relationships. --- class: onecol ## Measurement <img src="data:image/png;base64,#../figs/qmps.png" width="85%" /> .footnote[ Flake & Fried (2020); *Advances in Methods and Practices in Psychological Science* ] --- class: onecol ## Sample Size One of the most common questions we hear is .imp["how much data do I need for ML?"] Unfortunately, there is no straightforward or universal answer. However, here are some important principles and general guidelines. -- <p style="padding-top:30px;">1) When working with smaller sample sizes, use simpler models. 2) Use k-fold CV on the full dataset for smaller samples rather than one held-out test set. 3) Better yet, use nested CV for smaller samples! --- class: onecol ## Bias and Fairness in Algorithms Algorithms are often heralded as 'objective'. However, even ML algorithms reflect the nature of the data used to train them. ML can .imp[detect, learn, and perpetuate societal injustices] if we are not careful. -- <p style="padding-top:30px;">For example: - Chatbots trained on internet data produce sexist & racist responses - Recidivism algorithms are biased against Black defendants - Facial recognition works better for White people than people of color - STEM advertisements are less likely to be displayed to women than men --- class: onecol ## Bias and Fairness in Algorithms .pull-left[ <img src="data:image/png;base64,#../figs/obermeyer.png" width="96%" style="display: block; margin: auto 0 auto auto;" /> ] -- .pull-right[ <img src="data:image/png;base64,#../figs/chatgpt_bias.png" width="100%" style="display: block; margin: auto auto auto 0;" /> ] --- class: onecol ## Bias and Fairness in Algorithms Training a model with .imp[biased data] will produce .imp[biased predictions]. Biased machine learning models can perpetuate existing societal disparities. -- <p style="padding-top:30px;">You should critically evaluate the potential for bias at each step of the ML workflow: - Defining your question: who benefits from this work? - Data collection/curation: who is included in your sample, and who is left out? - Label definition: any discrepancy in ideal vs. actual outcome? (e.g., healthcare needs) - Measurement: are measures (labels, features) accurate and reliable for everyone? - Implementation: how do model predictions interact with human decision-making? How might this model be (mis)used to cause harm? --- class: inverse, center, middle # Modeling Decisions --- class: twocol ## Choosing an Algorithm Algorithm  | Benefits | Drawbacks :------- | :-------- | :------- Ridge | handles multicollinearity; shrinks correlated features together | does not perform feature selection; does not model nonlinearity Lasso | handles multicollinearity; performs feature selection | tends to pick one correlated feature & ignore the rest; does not model nonlinearity Elastic Net | Ridge-like regression with lasso-like feature selection | does not model nonlinearity Decision Trees | easily interpretable; models nonlinearity | unstable; poor prediction in new datasets (not often used in practice) Random Forests | models nonlinearity, good prediction in new data | not easily interpretable, requires larger sample sizes SVM | can handle `\(p>n\)`; models nonlinearity | not easily interpretable; can be difficult to choose a 'good' kernel function --- class: inverse, center, middle # Critically Evaluating Results --- class: onecol ## What is "Good" Performance? Not all performance metrics are equally informative for all prediction problems. Consider your ultimate use case - what is most important for your model? - Is it more important to detect **true positives**? - Is it more important to avoid **false negatives**? - Are your data imbalanced? These goals may differ across different modeling problems. --- class: onecol ## Can you Trust your Model? An ML model can always provide a prediction, given input features. However, in some situations it is not appropriate to make such predictions: - If a new data point is outside the range of training data - If a model was trained for a different context - If the data-generating process changes from model training to deployment -- .pull-left[ <img src="data:image/png;base64,#../figs/dogormuffin.png" width="96%" style="display: block; margin: auto 0 auto auto;" /> ] -- .pull-right[ <img src="data:image/png;base64,#../figs/muffin_test.png" width="100%" style="display: block; margin: auto auto auto 0;" /> ] --- class: onecol ## Can you Trust your Model? In some cases, the amount of .imp[uncertainty] is also too high for us to trust a prediction. Recall that many classification models use 0.50 as a decision boundary between classes. -- <p style="padding-top:30px;">Imagine you took a COVID test and it comes back positive (predicted class = 1). But, imagine you then learn the *predicted probability* was only 0.51. Would you trust the prediction from this COVID test? -- <p style="padding-top:30px;">**Equivocal zones**<sup>1</sup> can help index if uncertainty is too high for predictions to trusted. These can be set manually (e.g., 0.50 `\(\pm\)` 0.15) or based on standard errors. .footnote[ [1]See more on the tidymodels website: ] --- class: onecol ## Importance of External Validation Throughout this course, we have learned methods for **internal cross-validation**. However, .imp[external validation] remains the gold standard for evaluating ML models. -- .pull-left[ <img src="data:image/png;base64,#../figs/covid.png" width="100%" /> ] -- .pull-right[ ML models have been very popular in predicting COVID-19 outcomes. Many papers made impressive claims about predictive accuracy. However, when 22 published ML models were tested on new data, *none* beat simple univariate predictors. ] --- class: inverse, center, middle # Reading and Reviewing ML Papers --- class: onecol ## Peer-Reviewing ML Papers ML papers in psychology (and other social/behavioral sciences) are .imp[rapidly increasing]. Often, this feels like a double-edged sword. - On one hand, ML methods can help solve many problems we care about. - On the other, some papers may not be high-quality. - People may rush to publish before fully understanding their models. -- <p style="padding-top:30px;">Either way, this means there are **many papers that need to be reviewed!** As responsible ML practitioners, you should be well-equipped to review these papers. Here are some aspects to pay attention to when reviewing ML papers. --- class: twocol ## Is the Sample Appropriate? .left-column.pv3[ <img src="data:image/png;base64,#../figs/sample.jpg" width="100%" /> ] .right-column[ **As a reader/reviewer, consider:** - Is the sample is representative of the population? - Will a model trained on this sample generalize to future use cases? - Are any groups under- or over-represented in the sample? - Is sample size adequate (given their specific modeling methods)? - Is the sample well-suited for answering the research question? ] --- class: twocol ## Measurement and Feature Engineering .left-column.pv3[ <img src="data:image/png;base64,#../figs/engineer.jpg" width="100%" /> ] .right-column[ **As a reader/reviewer, consider:** - Measurement (and justification) of features and labels - Rationale for and clear description of feature engineering - Appropriate feature engineering to match specific algorithms (e.g., normalizing features for regularized regression) - Clear description of missing data and missing data handling (e.g., listwise deletion vs imputation) ] --- class: twocol ## Resampling and Data Leakage .left-column.pv3[ <img src="data:image/png;base64,#../figs/leak.jpg" width="100%" /> ] .right-column[ **As a reader/reviewer, consider:** - Is there clear separation between model training vs. evaluation? - Is there any evidence of data leakage? - Adequate size of test sets during resampling - Understanding of limitations of resampling methods - Appropriate handling of multilevel/nested data (if applicable) - Appropriate handling of time series data (if applicable) ] --- class: twocol ## Model Evaluation and Interpretation .left-column.pv3[ <img src="data:image/png;base64,#../figs/target.jpg" width="100%" /> ] .right-column[ **As a reader/reviewer, consider:** - Are interpretations and claims supported by the modeling methods? - Are evaluation metrics differentiated appropriately (e.g., accuracy vs. AUROC vs. sensitivity vs. specificity)? - Appropriate interpretation of performance - Not overstating performance - Limitations adequately stated ] --- class: onecol ## Summary & Wrap-Up Throughout this course, we have learned the **fundamentals of machine learning**. We learned ML **methods** (feature engineering, cross-validation, performance evaluation) We also learned ML **algorithms** (ridge, lasso, elastic net, trees, random forests, SVM) -- <p style="padding-top:30px;">We used {tidymodels} and learned to **train, tune, test, and visualize ML models**. You can now fit your own ML models, evaluate ML papers, and learn advanced methods! To continue learning more, we recommend: - [Applied Predictive Modeling]( (Kuhn & Johnson, 2013) - [Introduction to Statistical Learning]( (James, Witten, Hastie, & Tibshirani, 2021) --- class: inverse, center, middle # Time for a Break!
--- class: inverse, center, middle # Resampled Model Coefficients --- class: onecol ## Prepare Data and Folds, Set up Workflow .scroll.h-0l[ ``` r library(tidymodels) library(tidyverse) # Load data titanic <- read_csv("") # Create data splits, stratified by fare set.seed(2022) fare_split <- initial_split(data = titanic, prop = 0.8, strata = fare) fare_train <- training(fare_split) fare_test <- testing(fare_split) fare_folds <- vfold_cv(data = fare_train, v = 10, repeats = 3, strata = fare) # Specify model lin_reg <- linear_reg() %>% set_engine("lm") %>% set_mode("regression") # Feature engineering fare_recipe <- recipe(titanic) %>% update_role(fare, new_role = "outcome") %>% update_role(pclass:parch, new_role = "predictor") %>% update_role(survived, new_role = "ignore") %>% step_naomit(fare) %>% step_mutate( pclass = factor(pclass), sex = factor(sex) ) %>% step_dummy(all_nominal_predictors()) %>% step_impute_linear(age) %>% step_nzv(all_predictors()) %>% step_corr(all_numeric_predictors()) %>% step_lincomb(all_numeric_predictors()) %>% step_normalize(all_predictors()) # Prepare workflow fare_wflow <- workflow() %>% add_model(lin_reg) %>% add_recipe(fare_recipe) ``` ] --- class: onecol ## Extract Coefficients The goal of `fit_resamples()` is to .imp[measure model importance]. Therefore, by default, the models trained are not saved or used later. To save model coefficients, we need to .imp[extract] the underlying model object (engine fit). We can do this with a custom function, and specify it in `control_resamples()`: -- ``` r get_resample_coefs <- function(x) { x %>% # get the lm model object extract_fit_engine() %>% # transform its format tidy() } # save predictions from resampling get_coefs <- control_resamples(save_pred = TRUE, extract = get_resample_coefs) ``` --- class: onecol ## Fit Model with Resampling Now that we've set the workflow & configured resampling, we're ready to fit the model! -- ``` r #train the model using the recipe, data, and method fare_fitr <- fare_wflow %>% fit_resamples(resamples = fare_folds, control = get_coefs) ``` --- class: onecol ## Model Coefficients Extracting the coefficients we need is a bit messy, but let's walk through it. The output of `get_resample_coefs()` is contained in the column `.extracts`. -- .scroll.h-2l[ ``` r fare_fitr #> # Resampling results #> # 10-fold cross-validation repeated 3 times using stratification #> # A tibble: 30 × 7 #> splits id id2 .metrics .notes .extracts .predictions #> <list> <chr> <chr> <list> <list> <list> <list> #> 1 <split [940/106]> Repeat1 Fold01 <tibble> <tibble> <tibble> <tibble> #> 2 <split [940/106]> Repeat1 Fold02 <tibble> <tibble> <tibble> <tibble> #> 3 <split [941/105]> Repeat1 Fold03 <tibble> <tibble> <tibble> <tibble> #> 4 <split [941/105]> Repeat1 Fold04 <tibble> <tibble> <tibble> <tibble> #> 5 <split [941/105]> Repeat1 Fold05 <tibble> <tibble> <tibble> <tibble> #> 6 <split [941/105]> Repeat1 Fold06 <tibble> <tibble> <tibble> <tibble> #> 7 <split [942/104]> Repeat1 Fold07 <tibble> <tibble> <tibble> <tibble> #> 8 <split [942/104]> Repeat1 Fold08 <tibble> <tibble> <tibble> <tibble> #> 9 <split [942/104]> Repeat1 Fold09 <tibble> <tibble> <tibble> <tibble> #> 10 <split [944/102]> Repeat1 Fold10 <tibble> <tibble> <tibble> <tibble> #> # ℹ 20 more rows ``` ] --- class: onecol ## Model Coefficients Inside `fare_fitr$.extracts` is *another* `.extracts` column. ``` r fare_fitr$.extracts[[1]] #> # A tibble: 1 × 2 #> .extracts .config #> <list> <chr> #> 1 <tibble [7 × 5]> Preprocessor1_Model1 ``` --- class: onecol ## Model Coefficients This nested column has the coefficients from each k-fold! ``` r fare_fitr$.extracts[[1]]$.extracts #> [[1]] #> # A tibble: 7 × 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 33.1 1.28 25.9 6.57e-112 #> 2 age -0.804 1.56 -0.517 6.05e- 1 #> 3 sibsp 5.78 1.44 4.03 6.11e- 5 #> 4 parch 7.49 1.41 5.32 1.27e- 7 #> 5 pclass_X2 -26.5 1.62 -16.3 5.50e- 53 #> 6 pclass_X3 -36.7 1.80 -20.4 5.25e- 77 #> 7 sex_male -4.37 1.34 -3.27 1.10e- 3 ``` --- class: onecol ## Model Coefficients We can tidy up these results with the {tidyr} `unnest()` function. .scroll.h-1l[ ``` r fare_coefs <- fare_fitr %>% select(id, id2, .extracts) %>% unnest(.extracts) %>% unnest(.extracts) %>% mutate(resample = paste0(id, "_", id2)) fare_coefs #> # A tibble: 210 × 9 #> id id2 term estimate std.error statistic p.value .config resample #> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> #> 1 Repeat1 Fold01 (Inte… 33.1 1.28 25.9 6.57e-112 Prepro… Repeat1… #> 2 Repeat1 Fold01 age -0.804 1.56 -0.517 6.05e- 1 Prepro… Repeat1… #> 3 Repeat1 Fold01 sibsp 5.78 1.44 4.03 6.11e- 5 Prepro… Repeat1… #> 4 Repeat1 Fold01 parch 7.49 1.41 5.32 1.27e- 7 Prepro… Repeat1… #> 5 Repeat1 Fold01 pclas… -26.5 1.62 -16.3 5.50e- 53 Prepro… Repeat1… #> 6 Repeat1 Fold01 pclas… -36.7 1.80 -20.4 5.25e- 77 Prepro… Repeat1… #> 7 Repeat1 Fold01 sex_m… -4.37 1.34 -3.27 1.10e- 3 Prepro… Repeat1… #> 8 Repeat1 Fold02 (Inte… 33.1 1.28 25.9 3.76e-112 Prepro… Repeat1… #> 9 Repeat1 Fold02 age -0.388 1.52 -0.255 7.99e- 1 Prepro… Repeat1… #> 10 Repeat1 Fold02 sibsp 4.80 1.43 3.35 8.53e- 4 Prepro… Repeat1… #> # ℹ 200 more rows ``` ] --- class: onecol ## Model Coefficients Now we can plot the model coefficients across each k-fold resample: ``` r # plot coefficients across each resample fare_coefs %>% filter(term != "(Intercept)") %>% ggplot(aes(x = term, y = estimate, group = resample, col = resample)) + geom_hline(yintercept = 0, lty = 3) + geom_line(alpha = 0.3, lwd = 0.5) ``` <img src="data:image/png;base64,#slides_4b_files/figure-html/unnamed-chunk-19-1.png" width="70%" />