Mining Novice Programmer Errors Emily S. Tabanao MS Computer Science Ateneo de Manila University

Mining Novice Programmer ErrorsEmily S. TabanaoMS Computer ScienceAteneo de Manila University

Problem: Poor programming comprehension • first-year computer science students lack programming comprehension • failing rate in an introduction to programming class in Australia is as high 35% • 30% of computer science students in the United Kingdom and the United States did not understand programming basics after their first programming class • students have a fragile grasp of programming and were unable to read, analyze, and trace through short fragments of code

As a response to this problem • Research is conducted to: • know the characteristics of novice programmers • Causes of their problems • Find possible solutions

The difficulties of programming may be caused by: • lack of a mental model • misconception of programming constructs • lack of programming strategies • lack or absence of debugging strategies

Factors affecting performance of novices: • Prior to entering CS1 • Gender • secondary school performance • dislike of programming • intrinsic motivation and comfort level • high school mathematics background • prior programming experience • attribution to luck for success/failure, and • perceived understanding of the material

Factors affecting performance of novices: • Behaviors that have positive effect on performance: • perfectionism and self-esteem, and • high states of arousal or delight • Behaviors that have negative effect on performance: • disliking programming • frustration • Confusion • boredom and • IDE-related on-task conversation

Goal of the Study • Determine whether analysis of online protocols can successfully identify/predict at-risk novice Java programmers

Online protocols • sequence of program compilations while performing laboratory exercises • Are gathered by enhancing development environments used in programming to store data in a database

Research Questions • How do students with different achievement levels differ in terms of • Error profiles? • Average time between compilation profiles? • EQ profiles? • What factors can predict the midterm score?

Methodology • Participants • 143 Introduction to Computing students • Tools for Data Collection • BlueJ • WebServer • Sqlite Database • LAN

Methodology • Procedure • Laboratory Setup • Orientation • Data Gathering • Data Analysis • Data Cleaning • Data Extraction

Methodology • Data Analysis • Generate summaries • Errors encountered • Time between compilations • Compute EQ score • Use statistical tool R Stat • Perform one-way Anova to differentiate student groups • correlate EQ score with midterm exam score • Use datamining tool (Rapidminer and Weka) for creating linear regression models

Error Quotient (EQ) • Developed by Matthew Jadud • Quantifies students’ compilation behavior • Characterizes how much or little a student struggles with syntax errors • EQ score ranges from 0.0 to 1.0, where a 1.0 is an indication that a student encountered the same error all throughout the compilations

The EQ algorithm Do both events end in errors? Same error type? Y Add 2 Y Start N N Add 2 Add 2 Y Same edit location? Same error location? N N Y End Add 3

Results: Midterm Score • Lowest score=38 • Highest score=96 • Mean=75, Standard Deviation=13 • Student Grouping: • AtRisk – scores 62 and below • HighPerforming-scores 89 and above • Average= scores 63 to 88

1a.How do students with different achievement levels differ in terms of Error profiles?

Using one-way Anova on Total Errors vs Groups • HighPerforming group was significantly different from the AtRisk and Average groups at p < .001 and have lower number of errors encountered compared to the two • Average group is not significantly different from the AtRisk group

1b. How do students with different achievement levels differ in terms of average time between compilations Profiles?

Using one-way Anova on Average Time Between Compilations vs Groups • HighPerforming group was significantly different from the AtRisk and Average groups and they have higher average time between compilations compared to the two groups • There was no significant difference between the Average and AtRisk groups

Using one-way Anova on the Time Between Compilation per 10 sec bins vs Groups • the HighPerforming group was significantly different from the Average and AtRisk groups except on the time intervals • 21-30, 111-120 and >120 seconds for the Average group • 81-90 seconds for the AtRisk group • the HighPerforming group have lower number of compilations • there was no significant difference between the Average and AtRisk group in all time intervals

1c. How do students with different achievement levels differ in terms of EQ Profiles?

2. What factors can predict the midterm score? • Linear Regression was performed to come up with models-regression line in the formY = aX + b • Two questions to ask about the model: • Does the model fit the observed data well? • Compute correlation coefficient r, a measure of the relation between X and Y • look at the scatterplot • Compute R2 – the square of the correlation coefficient r, measures the strength of the relationship between X and Y • Compute BiC’-Bayesian Information Criterion • Can the model generalize to other samples? • Can the model predict the same outcome from the same set of predictors in a different sample? • Adjusted R2 – indicates the loss of predictive power of the model

2a. Predicting the midterm score using the Total errors encountered Model 1: MidtermScore = 83.63049 - 0.0919*TotalErrors p-value < .001, BiC’ = -7.8, Adjusted R2=0.161

2a. Predicting the midterm score using the Top Ten errors encountered Model 2: MidtermScore = 83.50274 - 0.25632*UNKNOWN_VARIABLE - 0.42035*CLASS_INTERFACE_EXP - 0.75506*UNKNOWN_CLASS p-value < .001, r = BiC’ = -10.2635, Adjusted R2= 0.1994,

2b. Predicting the midterm score using Average Time between compilations Model 3: MidtermScore = 65.04788 + 0.12107*AverageTBC_seconds p-value < .01, BIC = -1.97243, Adjusted R2 = 0.06512

2b. Predicting the midterm score using Average Time between compilations in 10 sec bins Model 4: MidtermScore = 87.4381 - 2.0042*Twenty + 6.4780*Ninety + 7.4892*Hundred p-value < .01, BIC = -7.01032, Adjusted R2 = 0.1263

c. Predicting the midterm score using EQ scores Model 5: MidtermScore = 92.918 - 64.396*EQ p-value < .001, BIC = -17.3303 Adjusted R2 = 0.2971,

Combining all features in Models 1 to 5: Model 6: MidtermScore = 90.58643 - 43.33380*EQ p-value < .001, BIC = -20.8326, Adjusted R2 = 0.3073

Conclusions and Future Work • We found: • Students encounter similar error types • Total Errors Encountered • HighPerforming < Average <= AtRisk • Three out of the top 10 errors may affect the midterm scores of the Average and AtRisk students • Average Time between compilations among HighPerforming students are higher compared to the Average and AtRisk students • EQ among HighPerforming students are lower compared to the Average and AtRisk students

Conclusions and Future Work • Linear Models • Informs which errors directly affects the midterm score which implicitly points to the concepts that AtRisk students need assistance • High incidence of rapid fire compiling maybe a symptom of AtRisk students • EQ can significantly predict Midterm Scores

Conclusions and Future Work • Use the models to automatically detect AtRisk students while using an IDE • Implications on teaching: to address concepts that help students resolve the errors that directly affects performance

Thank you... Questions?

Mining Novice Programmer Errors Emily S. Tabanao MS Computer Science Ateneo de Manila University

Mining Novice Programmer Errors Emily S. Tabanao MS Computer Science Ateneo de Manila University

Presentation Transcript

MS in Computer Science

Computer Programmer

Computer Programmer

MANILA CENTRAL UNIVERSITY

ATENEO DE MANILA UNIVERSITY

Programmer Defined Functions Common Errors

Novice Programmer Errors

Mentoring Novice Science Teachers

Novice Programmer Planning: A Grounded Theory Approach

Japanese Studies Program Ateneo de Manila University 食と日本の現在

By Dr. George V. Carmona Ateneo De Manila University School of Law

CITS Ateneo Computer Center

Ateneo de Davao University

Ateneo de Manila University School of Science and Engineering

Bienvenido F. Nebres, S.J. President, Ateneo de Manila University

Computer programmer

Ateneo de Davao University Ms. Trixie Anne Dagatan Kindergarten Curriculum June 2013-March2014

Computer Ethics Novice Level

Ateneo de Davao University Ms. Trixie Anne Dagatan Kindergarten June 2013-March2014

MS Computer Science vs MIS