IN PERSON – 5-day Statistics Short Course
Data Mining and Machine Learning Seminar Overview:
An intermediate 5-day course introducing several popular machine learning approaches such as regression based methods (ridge and lasso regularized regression, regression splines), tree methods (random forests, boosted trees), support vector machines, and Interpretative Machine Learning (ILM) as well as their application to empirical data. The course combines lectures and hands-on practice using R.
Seminar Topics:
- Review of linear regression and the least squares criterion
- Regularization methods (ridge regression, lasso, elastic net)
- Regression splines
- Prediction error and k-fold cross validation
- Tree methods to predict categorical or continuous outcomes (CART, random forest, boosting
- Interpretative Machine Learning (IML)
- Support vector machines for classification
Seminar Description:
Machine learning refers to leveraging data to build statistical models or algorithms. The objective is usually to gain knowledge about the structure in the data in order to make predictions or decisions.
This short course is based on
- An Introduction to Statistical Learning (James, Witten, Hastie, Tibshirani)
- Hands-on Machine Learning with R (Boemke & Greenwell)
- Interpretable Machine Learning (Molnar)
The course starts with briefly outlining the key differences and similarities between standard parametric modeling (e.g., linear regression, structural equation modeling) and machine learning (aka statistical learning, aka data mining). The course provides basic insights into a number of popular methods such as regression methods (ridge regression and the lasso, regression splines), tree methods (CART, random forests, boosting), interpretable machine learning (IML), and support vector machines. The emphasis is on a conceptual understanding of these methods and their appropriate application to empirical data. Importantly, these methods are useful not only for large data collections, but also more generally for exploratory analyses when the substantive theory to design and fit parametric models (e.g. SEM) is lacking. Machine learning is used in a wide variety of fields including but not limited to public health, education, biology, and the different social sciences.
Participants are invited to discuss potential machine learning applications to their data during individual consultations with the instructor scheduled at the end of days 2-5.
Participants will receive an electronic copy of all course materials, including lecture slides, practice datasets, software scripts (R), relevant supporting documentation, and recommended readings. Participants will also have access to a video recording of the course.
Instructor: Gitta Lubke, Ph.D.
Gitta Lubke is a Professor Emerita in the Department of Psychology/Quantitative Area at the University of Notre Dame. Her research interests included machine learning and general latent variable modeling. Empirical applications were mainly in the field of psychiatric disorders and behavioral genetics. Other areas of expertise include mixture models, twin models, multi-group factor analysis and measurement invariance, longitudinal analyses, and the analysis of categorical data.
APA Continuing Education Credits:
This course offers 29 hours of Continuing Education Credits. Stats Camp Foundation is approved by the American Psychological Association to sponsor continuing education for psychologists. Stats Camp Foundation maintains responsibility for this program and its content.
Seminar Includes:
Materials, downloads, recorded course video viewable for up to one year.