class: center, middle, inverse, title-slide .title[ #
Conceptual Introductions
] .subtitle[ ## 👩💻️ 🤖 👨🏫️ ] .author[ ### Machine Learning in R
SMaRT Workshops
] .date[ ### Day 1A Jeffrey Girard ] --- class: twocol ## Jeffrey Girard .pull-left[ Assistant Professor<br /> University of Kansas **Research Areas** - Affective Science - Clinical Psychology - Computer Science **Machine Learning** - Recognition of Facial Expressions - Prediction of Emotional States - Prediction of Mental Health Status ] .pull-right.center.lh-copy[ <img src="data:image/png;base64,#../figs/jg_headshot.jpeg" width="240" height="240" /> [www.jmgirard.com](https://affcom.ku.edu/girard)<br /> [jmgirard@ku.edu](mailto:jmgirard@ku.edu)<br /> [@jeffreymgirard](https://twitter.com/jeffreymgirard) ] --- class: twocol ##Shirley Wang .pull-left[ Incoming Assistant Professor<br /> Yale University **Research Areas** - Clinical Psychology - Computational Psychiatry - Suicide, Self-Harm, Eating Disorders **Machine Learning** - Prediction of Suicide Risk - Intensive Longitudinal Data - Algorithmic Bias & Fairness ] .pull-right.center.lh-copy[ <img src="data:image/png;base64,#../figs/sw_headshot.png" width="240" height="240" /> [ccslab.yale.edu](https://ccslab.yale.edu/)<br /> [shirley.wang@yale.edu](mailto:shirley.wang@yale.edu)<br /> [@ShirleyBWang](https://twitter.com/ShirleyBWang) ] --- class: twocol ## Goals and Timeline .pull-left[ **Build a foundation** of concepts and skills **Describe every step** from start to finish Emphasize **practical and applied** aspects Provide intuitions rather than lots of theory Dive deeper into a few algorithms Highlight algorithms good for beginners Communicate the pros and cons of choices ] -- .pull-right.pv4[ <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Day </th> <th style="text-align:left;"> Parts </th> <th style="text-align:left;"> Topics </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:left;"> A,B </td> <td style="text-align:left;"> Concepts, tidyverse, data </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> C,D </td> <td style="text-align:left;"> Validation, model fitting </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:left;"> A,B </td> <td style="text-align:left;"> Workflows, metrics, recipes </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> C,D </td> <td style="text-align:left;"> Resampling, cross-validation </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:left;"> A,B </td> <td style="text-align:left;"> GLMNET, tuning </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> C,D </td> <td style="text-align:left;"> Random forest, reporting </td> </tr> <tr> <td style="text-align:left;"> 4 </td> <td style="text-align:left;"> A,B </td> <td style="text-align:left;"> SVM, practical matters </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:left;"> C,D </td> <td style="text-align:left;"> Q&A, consultation </td> </tr> </tbody> </table> ] --- class: twocol ## Format and Materials .pull-left[ The course will span **four days** (M, Tu, W, Th) Each day will be organized into **four parts** (Parts A–D, each running for 55 minutes) We will take a 10 minute break between<br />Parts A and B and between Parts C and D We will take a 60 minute break between<br />Parts B and C for lunch (12:30–1:30pm EDT) Most parts will have **lecture** and **live coding** We will provide optional **practice activities** to complete after each course day (or later) ] -- .pull-right.lh-copy[ Course materials are on the course website: - <https://smartworkshops.github.io/mlr/> Video recordings are on the SMaRT website: - <https://smart-workshops.com> A few inspirations for this workshop include: - [Applied Predictive Modeling](http://appliedpredictivemodeling.com/) - [Tidy Modeling with R](https://www.tmwr.org/) - [StatQuest](https://statquest.org/) ] .footnote[*Note.* Breaks are meant to provide both a rest and a time buffer (e.g., to allow questions and longwindedness).] --- class: inverse, center, middle # Conceptual Introduction --- class: onecol ## What is machine learning? The field of machine learning (ML) is a **branch of computer science** ML researchers **develop algorithms** with the capacity to .imp[learn from data] When algorithms learn from (i.e., are **trained on**) data, they create **models**<sup>1</sup> <p style="padding-top:20px;">This workshop is all about applying ML algorithms to create .imp[predictive models]</p> The goal will be to **predict unknown values** of important variables **in new data** .footnote[ [1] ML models are commonly used for prediction, data mining, and data generation. ] --- class: twocol .pull-left[ ## Labels / Outcomes Labels are variables we .imp[want to predict]<br />the values of (because they are unknown) Labels tend to be expensive or difficult to measure in new data (though are known in some existing data that we can learn from) AKA outcome, dependent, or `\(y\)` variables <img src="data:image/png;base64,#../figs/label_icons.png" width="100%" /> ] -- .pull-right[ ## Features / Predictors Features are variables we .imp[use to predict]<br />the unknown values of the label variables Features tend to be relatively cheaper and easier to measure in new data than labels (and are also known in some existing data) AKA predictor, independent, or `\(x\)` variables <img src="data:image/png;base64,#../figs/feature_icons.png" width="100%" /> ] --- class: twocol ## Modes of Predictive Modeling .pull-left[ When labels have continuous values, predicting them is called .imp[regression] <img src="data:image/png;base64,#../figs/regression_diagram.png" width="100%" /> - *How much will a customer spend?* - *What GPA will a student achieve?* - *How long will a patient be hospitalized?* ] -- .pull-right[ When labels have categorical values, predicting them is called .imp[classification] <img src="data:image/png;base64,#../figs/classification_diagram.png" width="100%" /> - *Is an email spam or non-spam?* - *Which candidate will a user vote for?* - *Is a patient's glucose low, normal, or high?* ] .footnote[*Unsupervised learning* (AKA data mining) has no explicit labels and just looks for patterns within the features.] --- class: onecol ## Modes of Predictive Modeling .pull-left[ .center[**Regression**] <img src="data:image/png;base64,#slides_1a_files/figure-html/regression_example-1.png" width="100%" /> ] -- .pull-right[ .center[**Classification**] <img src="data:image/png;base64,#slides_1a_files/figure-html/classification_example-1.png" width="100%" /> ] --- ## Comprehension Check \#1 <span style="font-size:30px;">Ann has developed an ML system that looks at a patient's physiological signals and tries to determine whether they are having a micro-seizure.</span> .pull-left[ ### Question 1 **The features are <span style="text-decoration: underline; white-space: pre;"> </span> and the labels are <span style="text-decoration: underline; white-space: pre;"> </span>.** a) physiological signals; physiological signals b) physiological signals; micro-seizure (yes/no) c) micro-seizure (yes/no); physiological signals d) micro-seizure (yes/no); micro-seizure (yes/no) ] .pull-right[ ### Question 2 **Which "mode" of predictive modeling is this?** a) Regression b) Classification c) Unsupervised learning d) All of the above ] --- class: inverse, center, middle # Modeling Workflow --- ## Typical ML Workflow .center.pv4[ <img src="data:image/png;base64,#../figs/workflow.png" width="100%" /> ] --- class: onecol ## Exploratory Analysis .left-column.pv3[ <img src="data:image/png;base64,#../figs/explore.jpg" width="100%" /> ] .right-column[ **Verify the quality of your variables** - Examine the distributions of feature and label variables - Look for errors, outliers, missing data, etc. **Gain inspiration for your model** - Identify relevant features for a label - Detect highly correlated features - Determine the "shape" of relationships ] --- class: onecol ## Feature Engineering .left-column.pv3[ <img src="data:image/png;base64,#../figs/engineer.jpg" width="100%" /> ] .right-column[ **Prepare the features for analysis** - *Extract* features - *Transform* features - *Re-encode* features - *Combine* features - *Reduce* feature dimensionality - *Impute* missing feature values - *Select* and drop features ] --- class: onecol ## Model Development .left-column.pv3[ <img src="data:image/png;base64,#../figs/develop.jpg" width="100%" /> ] .right-column[ **Choose algorithms, software, and architecture** - Elastic Net and/or Random Forest - `caret` or `tidymodels`, `elasticnet` or `glmnet` - Regression or classification **Train the model by estimating parameters** - Learn the nature of the feature-label relationships - For instance, estimate the intercept and slopes ] --- class: onecol ## Model Tuning .left-column.pv3[ <img src="data:image/png;base64,#../figs/tune.jpg" width="100%" /> ] .right-column[ **Determine how complex the model can become** - How many features to include in the model - How complex the shape of relationships can be - How many features can interact together - How much to penalize adding more complexity **Make other decisions in a data-driven manner** - Which of three algorithms should be preferred - Which optimization method should be used ] --- class: onecol ## Model Evaluation .left-column.pv3[ <img src="data:image/png;base64,#../figs/target.jpg" width="100%" /> ] .right-column[ **Decide how to quantify predictive performance** - In regression, performance is based on the errors/residuals - In classification, performance is based on the confusion matrix **Determine how successful your predictive model was** - Compare predictions (i.e., predicted labels) to trusted labels - Compare the performance of one model to another model ] --- ## Comprehension Check \#2 <span style="font-size:30px;">Yuki trained an algorithm to predict the number of "likes" a tweet will receive based on measures of the tweet's formatting and content.</span> .pull-left[ ### Question 1 **Calculating the length of each tweet is <span style="text-decoration: underline; white-space: pre;"> </span>?** a) Feature Engineering b) Model Development c) Model Tuning d) Model Evaluation ] .pull-right[ ### Question 2 **When should problems with the data be found?** a) Model Evaluation b) Model Tuning c) Model Development d) Exploratory Analysis ] --- class: inverse, center, middle # Signal and Noise --- class: onecol ## A Delicate Balance Any data we collect will contain a mixture of **signal** and **noise** - The "signal" represents informative patterns that generalize to new data - The "noise" represents distracting patterns specific to the original data We want to capture as much signal and as little noise as possible -- <p style="padding-top:30px;">More complex models will allow us to capture <b>more signal</b> but also <b>more noise</b></p> .imp[Overfitting]: If our model is too complex, we will capture unwanted noise .imp[Underfitting]: If our model is too simple, we will miss important signal --- ## Model Complexity <img src="data:image/png;base64,#slides_1a_files/figure-html/complexity-1.png" width="100%" /> --- class: twocol ## A Super Metaphor .left-column.pv3[ <img src="data:image/png;base64,#../figs/kryptonite.png" width="100%" /> ] .right-column[ What makes machine learning so amazing is<br /> its **ability to learn complex patterns** However, with this great power and flexibility<br /> comes the looming **danger of overfitting** Thus, much of ML research is about finding ways<br /> to **detect** and **counteract** overfitting For detection, we need *at least* two sets of data: .imp[Training set]: used to learn relationships .imp[Testing set]: used to evaluate performance ] --- layout: true ## An Example of Overfitting --- <img src="data:image/png;base64,#slides_1a_files/figure-html/overex1-1.png" width="100%" /> --- .pull-left[ .center[**Model A (Low Complexity)**] <img src="data:image/png;base64,#slides_1a_files/figure-html/overex2-1.png" width="100%" /> ] .pull-right[ .center[**Model B (High Complexity)**] <img src="data:image/png;base64,#slides_1a_files/figure-html/overex3-1.png" width="100%" /> ] --- count: false .pull-left[ .center[**Model A (Low Complexity)**] <img src="data:image/png;base64,#slides_1a_files/figure-html/overex4-1.png" width="100%" /> Total error on training data = 31.2 (.imp[High Bias]) ] .pull-right[ .center[**Model B (High Complexity)**] <img src="data:image/png;base64,#slides_1a_files/figure-html/overex5-1.png" width="100%" /> Total error on training data = 0.0 (.imp[Low Bias]) ] --- <img src="data:image/png;base64,#slides_1a_files/figure-html/overex6-1.png" width="100%" /> --- .pull-left[ .center[**Model A (Low Complexity)**] <img src="data:image/png;base64,#slides_1a_files/figure-html/overex7-1.png" width="100%" /> ] .pull-right[ .center[**Model B (High Complexity)**] <img src="data:image/png;base64,#slides_1a_files/figure-html/overex8-1.png" width="100%" /> ] --- count: false .pull-left[ .center[**Model A (Low Complexity)**] <img src="data:image/png;base64,#slides_1a_files/figure-html/overex9-1.png" width="100%" /> Total error on testing set = 17.4 (.imp[Low Variance]) ] .pull-right[ .center[**Model B (High Complexity)**] <img src="data:image/png;base64,#slides_1a_files/figure-html/overex10-1.png" width="100%" /> Total error on testing set = 41.6 (.imp[High Variance]) ] --- layout: false class: onecol ## Conclusions from Example In ML, .imp[bias] is a lack of predictive accuracy in the original data (the "training set") In ML, .imp[variance] is a lack of predictive accuracy in new data (the "testing set") -- .pt1[ An ideal predictive model would have both low bias and low variance However, there is often an inherent **trade-off between bias and variance**<sup>1</sup> ] .footnote[[1] To increase our testing set performance, we often need to worsen our performance in the training set.] -- .pt1[ We want to find the model that is **as simple as possible** but **no simpler** ] --- ## A Graphical Explanation of Overfitting <img src="data:image/png;base64,#../figs/overfitting.png" width="100%" /> --- ## A Meme-based Explanation of Overfitting <img src="data:image/png;base64,#../figs/overfitting_lay.png" width="100%" /> --- ## Comprehension Check \#3 <span style="font-size:30px;">Sam used all emails in his inbox to create an ML model to classify emails as "work-related" or "personal." Its accuracy on these emails was 98%.</span> .pull-left[ ### Question 1 **Is Sam done with this ML project?** a) Yes, he should sell this model right now! b) No, he needs to create a training set c) No, he needs to test the model on new data d) No, his model needs to capture more noise ] .pull-right[ ### Question 2 **Which problems has Sam already addressed?** a) Overfitting b) Underfitting c) Variance d) All of the above ] --- class: onecol ## Use Cases for Machine Learning 1) You are more interested in .imp[prediction] than understanding 2) Your goals are more .imp[practical] than scientific -- .pv1[ 3) You have a lot of observations (e.g., `\(n>625\)`) 4) You have a lot of predictor variables (e.g., `\(p>10\)`) ] -- 5) Many predictor variables are related to one another 6) The relationships between outcomes and predictors are **complex** -- .pv3[ .bg-light-green.b--dark-green.ba.bw1.br3.ph4[ **Note:** You can still publish papers about prediction and practical efforts! ] ] --- class: onecol ## But I want *both* prediction and understanding! - We will learn some tools for explaining models and predictions -- .pv1[ - Machine learning is really designed for prediction - Statistical modeling is really designed for inference ] -- - Advanced methods combine both prediction and inference - Statistically compare performance of different models - Statistically compare performance of different feature sets - Statistically compare performance for participant groups --- ## Activity: Icebreakers .f4[ In chat, post a brief introduction about yourself and react to at least one other attendee's post (e.g., with an emoji or by "replying in thread" with a message). ] #### Possible Inclusions: - Where are you joining us from? - What discipline are you in and what topics do you study? - What problems do you want to apply machine learning to? - What do you do for fun? #### Example: > Hi! I am Jeff, one of the two instructors for this course. I am an Assistant Professor of quantitative and clinical psychology at the University of Kansas (in Lawrence, Kansas, USA). I study affective and interpersonal functioning and how technology can be used to improve research in the social and medical sciences. I am interested in using machine learning to recognize facial expressions from images and to predict emotional states and mental health status from clinical data. For fun, I bring my 100 lb labrador/pyrenhees mix to the dog park or read sci-fi/fantasy novels. --- class: inverse, center, middle # Time for a Break!
−
+
10
:
00