COVID-19 UPDATE: The safety of our students and staff is our top priority. Therefore, Stats Camp will be holding seminars online via live interactive zoom discussion groups. Our goal is to expand on the interactivity side and provide one-on-one consulting time via virtual breakout rooms. We are offering a discount code to a future camp worth $200 off and we are offering 1-hour of post camp consultation as an added value. Registrations will be accepted up to 12 hours prior to seminar start date and time. All seminars will be conducted in CDT time and will be recorded. The recordings will be made available to you within 3-5 business days of the live recording date. Access will be granted to the recorded videos for 1 year from the date of the seminar. Have questions? Contact us

Introduction to Data Mining Seminar

Session 2: June 8 – 12, 2020
Albuquerque, NM – Embassy Suites

FAQVenue Info – Hotel Booking
$1,895 Faculty/Professional or $1,145 Student/Post-Doc

Payment Options

Seminar fee includes all materials, downloads, software access, training, refreshments and access to a recorded video of seminar:

Enrollment is open to public, students, graduates and professionals. Save a seat today, pay later.

Seminar Syllabus

MondayJune 8, 2020
9:00-9:30Welcome and introductions
9:30-10:45Simple and Multiple Linear Regression
10:45-11:00Snack and refreshment break
11:00-12:30Ridge Regression and Lasso
12:30-1:30Lunch break
1:30-3:00Application of Ridge Regression and Lasso
3:00-3:15Snack and refreshment break
3:15-5:00Prediction Error and Cross Validation
5:30~7:30Social “hour” for all Stats Campers
TuesdayJune 9, 2020
9:00-10:45Regression Splines
10:45-11:00Snack and refreshment break
11:00-12:30Application of Regression Splines
12:30-1:30Lunch break
1:30-3:00Introduction to Tree Methods
3:00-3:15Snack and refreshment break
3:15-5:00individual consultation with instructor
WednesdayJune 10, 2020
9:00-10:45 CART and bagging
10:45-11:00Snack and refreshment break
11:00-12:30Random Forests
12:30-1:30Lunch break
1:30-3:00Application of Random Forests
3:00-3:15Snack and refreshment break
3:15-5:00Individual consultation with instructor
ThursdayJune 11, 2020
9:00-10:45Boosted Trees
10:45-11:00Snack and refreshment break
11:00-12:30Boosted Trees
12:30-1:30Lunch break
1:30-3:00Application of Boosted Trees
3:00-3:15Snack and refreshment break
3:15-5:00Individual consultation with instructor
FridayJune 12, 2020
9:00-10:45Support Vector Machines
10:45-11:00Snack and refreshment break
11:00-12:30Application of Support Vector Machines
12:30-1:30Lunch break
1:30-3:00Individual consultation with instructor
3:00-3:15Snack and refreshment break
3:15-5:00Individual consultation with instructor

Why Should You Attend?

  • Get 1 on 1 Consultation With Instructor
  • Professional Networking
  • Peer Socializing
  • Collaboration
  • All Seminar Resources
  • Breakfast (Embassy guests), Lunches, & Snacks Daily

Data Mining Seminar Description


An intermediate 5-day course introducing several popular data mining approaches such as regression based methods (ridge and lasso regularized regression, regression splines), tree methods (random forests, boosted trees), and support vector machines, and their application to empirical data. The course combines lectures and hands-on practice using R.

Instructor: Gitta Lubke Ph.D.

Gitta Lubke is a Full Professor in the Department of Psychology/Quantitative Area at the University of Notre Dame. Her research interests are in data mining and general latent variable modeling. In addition to the challenges of analysing complex human behavior such as psychiatric disorders, she is interested in the analysis of genetic data. Related areas of expertise include mixture models, twin models, multi-group factor analysis and measurement invariance, longitudinal analyses, and the analysis of categorical data.

Course Description

In the age of rapidly increasing data collection endeavors it has become more and more important to understand how to find structure in data, especially when substantive theory about structural relations between the collected variables is not yet fully developed. This short course starts with briefly outlining the key differences and similarities between standard parametric modeling (e.g., linear regression) and data mining approaches. The course provides basic insights into a number of popular methods such as regression methods (ridge regression and the lasso, regression splines), tree methods (CART, random forests, boosting), and support vector machines. The emphasis is on a conceptual understanding of these methods and their appropriate application to empirical data. Importantly, these methods are useful not only for large data collections, but also more generally for exploratory analyses when the substantive theory to design and fit suitable parametric models (e.g. SEM) is not available. Data mining (aka statistical learning) is used in a wide variety of fields including but not limited to public health, education, biology, and the different social sciences.

Participants are invited to discuss potential data mining applications to their particular field of interest during individual consultations with the instructor scheduled at the end of the second and third day.

Participants will receive an electronic copy of all course materials, including lecture slides, practice datasets, software scripts, relevant supporting documentation, and recommended readings. Participants will also have access to a video recording of the course.

Course Topics

  • Review of linear regression and the least squares criterion
  • Regularization methods (ridge regression, lasso, elastic net)
  • Regression splines
  • Prediction error and k-fold cross validation
  • Tree methods to predict categorical or continuous outcomes (CART, random forest, boosting
  • Support vector machines for classification

Course Learning Goals

After engaging in course lectures and discussions as well as completing the hands-on practice activities with empirical data, participants will be able to:

  • Understand some of the key differences and similarities between parametric modeling and data mining methods
  • Expand the acquired basic knowledge of several popular data mining methods and apply these methods to empirical data
  • Assess and interpret the results of empirical analyses through k-fold cross validation and computation of prediction errors
  • Utilize R packages for data mining
  • Understand and evaluate scientific papers covering data mining applications to empirical data

Course Prerequisites


  • advanced proficiency in linear regression, including the estimation of regression coefficients using least squares
  • intermediate familiarity with iterative optimization (e.g. how to use the Newton-Raphson algorithm to find a maximum)
  • Intermediate proficiency with R
  • Intermediate knowledge of exploratory data analysis

Not required but advantageous:

  • At least limited experience (e.g., graduate-level course) in calculus
  • Understanding the relation between multiple testing and Type I error, and, more generally, the challenges of finding relevant predictors in large data sets

No level of proficiency beyond basic awareness is assumed for skills related to:

  • Data mining methods
  • More advanced mathematical or statistical topics such constrained estimation (e.g., using Laplace multipliers)

Course Software and Computer Support

Participants need to bring a laptop computer with Wi-Fi capabilities.

All statistical software used at Stats Camp will be available, free to participants, on our SMORS (statistical modeling on remote servers) system for the duration of camp.

All instruction for this course will be based on the freely available software program R. Please make sure to have a recent version installed.

Seminar Files

Seminar files will be provided by the instructor on the first day of the seminar. You do not need to download anything prior to the event date. All materials will be provided during or after the class.

Ask a Question About This Seminar