480 likes | 592 Views
Students Tackle Graduation Rates with Data Mining and JMP ® Pro. JMP Discovery Conference 2012 Jim Grayson & Mary Filpus-Luyckx Augusta State University. Agenda. Project: Origin and Context Objective Data Characteristics & Exploration Analysis: Partitioning & Logistic Regression
E N D
Students Tackle Graduation Rates with Data Mining and JMP®Pro JMP Discovery Conference 2012 Jim Grayson & Mary Filpus-Luyckx Augusta State University
Agenda • Project: Origin and Context • Objective • Data Characteristics & Exploration • Analysis: Partitioning & Logistic Regression • Insights and Recommendations Discovery Summit 2012
Augusta State University • Public state access university located in Augusta, Georgia • Enrollment: 6,741 students • Retention Rate from first to second year: 67% • Graduation Rate (6 years): 22% Georgia Health Sciences University • Public state health research university also located in Augusta, Georgia • Enrollment: 2,400 students (400 undergrads, upper-level only) • Graduation Rate (6 years): 96% The New University • Research University with access? We need to find out why ASU’s retention and graduations rates are so low and we need to do it NOW!
Hull College of Business • Business Analytics Class • Elective: Marketing, MIS • Project Orientation • Campus Client: Institutional Research Discovery Summit 2012
Student Preparation • Use JMP as Analytical Engine • Text: Data Mining for Business Intelligence, 2ed, Shmueli, et al. • Primary Methods: Multiple Regression, Partitioning, Logistic Regression, Clustering Discovery Summit 2012
Business Analytics Project • Purpose: This study is being undertaken for two purposes: • To better understand the characteristics common to students that do not complete graduation in six years, and • To develop and validate a model to predict whether a student will graduate in six years. Discovery Summit 2012
Deliverables Your preliminary deliverable is a “technical review” to show (a) the results of your model development, and (b) your assessment of the model’s validity and usefulness. The final deliverables of this project are (a) project report describing your data mining process including your recommendations, supporting data, and analysis, and (b) project presentation which effectively communicates the background of the project and its insights and recommendations. The project report should be organized with the following sections: Executive Summary, Insights and Recommendations, Model Development Process and Results. Supporting data and charts should be included within the body of the report if they are referenced in the narrative, otherwise, these data and charts should be organized in appendices. Discovery Summit 2012
Project Steps • Translate the business problem into a data mining problem • Describe the problem opportunity and business benefit • Briefly describe other research that will leverage your efforts • Select appropriate data • Explain the data identified • Explain the process of selecting data • Get to know the data • Describe the data • Describe insights gained from exploring the data set Discovery Summit 2012
Steps Con’t • Create a model set [this is being done for you by Institutional Research] • Fix problems with the data • Transform data • As necessary, data transformations such as normalizing the data, etc. • Converting variables into subsets to facilitate analysis Discovery Summit 2012
Steps Con’t • Build models • Choice of techniques and rationale • Model results • Assess models • Usefulness for predictability • Performance measures • Interpret the results • Implications of the results • Limitations (what you would have done if you could and what you want to do next) • Recommendations to the project sponsor Discovery Summit 2012
Student Reports Discovery Summit 2012
Data Snapshot Discovery Summit 2012
Response Variable Discovery Summit 2012
Graphical Exploration of Relationships Discovery Summit 2012
Exploring 1-Way Relationships • Categorical Variables • Race • Type of High School • Continuous Variables • HS GPA • SAT V • SAT M Discovery Summit 2012
Exploring 1-Way Relationships • Categorical Variables • Race • Type of High School • Continuous Variables • HS GPA • SAT V • SAT M Discovery Summit 2012
Many Relationships: Scatterplots Not Graduated Graduated Discovery Summit 2012
Partitioning Discovery Summit 2012
Partitioning Methods • Decision Tree • Bootstrap Forest • Boosted Tree * Includes the First Term GPA Discovery Summit 2012
Decision Tree • First Term GPA • Race • SAT M Discovery Summit 2012
Decision Tree Including First Term GPA Discovery Summit 2012
Decision Tree Including First Term GPA Discovery Summit 2012
Decision Tree Including First Term GPA Discovery Summit 2012
Decision Tree • HS GPA • Race • SAT M • FT/PT Discovery Summit 2012
Decision Tree Without First Term GPA Discovery Summit 2012
Without First Term GPA Decision Tree Discovery Summit 2012
Decision Tree Without First Term GPA Discovery Summit 2012
Bootstrap Forest • First Term GPA • HS GPA • Age • SAT M • Race Including First Term GPA Discovery Summit 2012
Bootstrap Forest • HS GPA • SAT V • SAT M • Race • Age Without First Term GPA Discovery Summit 2012
Boosted Tree • First Term GPA • SAT M • Race • Age Including First Term GPA Discovery Summit 2012
Boosted Tree • HS GPA • HS Type • SAT V • SAT M • Race Without First Term GPA Discovery Summit 2012
Conclusions: Partition Models • After student is enrolled, the best factor to track to intervene is the First Term GPA • When accepting students the following factors could indicate a support system will be necessary to facilitate success • High School GPA • SAT Math and Verbal Scores • Type of High School • Race, Age and Gender of student Discovery Summit 2012
Logistic Regression Discovery Summit 2012
Logistic Regression Model Discovery Summit 2012
Model Parameters Discovery Summit 2012
ROC Curve Discovery Summit 2012
Misclassification Misclassification Rate = (46 + 182)/(29 + 182 + 46 + 603) = 26.5% Discovery Summit 2012
Logit Model Discovery Summit 2012
Model Implications Discovery Summit 2012
Model Scoring Probability Of Graduating is: Where Lin[Grad] is: Discovery Summit 2012
Logit (Before Entering) Discovery Summit 2012
Model Scoring(Before Entering) Probability Of Graduating is: Where Lin[Grad] 2 is: Discovery Summit 2012
Model Implications (Before Entering) Discovery Summit 2012
Conclusions: Logistic Regression • Before Entering: Biggest “odds enhancer” is HSGPA • After Entering: Biggest “odds enhancer” is First Term GPA • Either case, special attention to students with “odds” classifiers below 1 and low First Term GPAs Discovery Summit 2012
Study Conclusions • At admissions students should be identified as “high risk” who match the qualifiers identified in our models • In the first semester students with low GPAs should be identified and provided help and mentoring Discovery Summit 2012
Acknowledgements and References Data Mining for Business Intelligence, 2ed by GalitShmueli, Nitin R. Patel and Peter C. Bruce We acknowledge the help of the following individuals: Kerrie Scott, Institutional Research Office Discovery Summit 2012