260 likes | 498 Views
Local outlier detection in data forensics: data mining approach to flag unusual schools. Mayuko Simon Data Recognition Corporation May, 2012. Statistical methods for data forensic. Univariate distributional techniques: e.g., average wrong-to-right erasures. Multivariate techniques
E N D
Local outlier detection in data forensics: data mining approach to flag unusual schools • Mayuko Simon • Data Recognition Corporation • May, 2012
Statistical methods for data forensic • Univariate distributional techniques: e.g., average wrong-to-right erasures. • Multivariate techniques • Simple regression. E.g., 2011 Reading is predicted by 2010 Reading score. • A school is flagged if the observed dependent variable differs significantly from the model’s prediction • The schools are flagged when it is an outliers compared to ALL other schools Global outlier
What if schools is suspicious but not extreme? • Schools with suspicious behavior may not display sufficient extremity to make them outliers in comparison to all schools. • Nevertheless, it is reasonable to assume that their scores will be higher than that of their peers—schools that are very similar in many relevant aspects. Local outlier
Local v.s. global outlier • Traditional statistical data forensic techniques lack the ability to detect local outliers • Regression will miss the blue rectangle • Univariate approach (e.g., using only variable a) will miss both blue rectangle and red triangle • Cluster analysis is not for outlier detection
The goal of RegLOD • Regression based local outlier detection algorithm is introduced: RegLOD. • We wish to find schools that are very similar to the peers in most respects (in terms of most independent variables) but differ significantly in current year’s score (the dependent variable).
Assumptions of RegLOD • When most independent variables are very similar, we expect the dependent variable to be similar, as well. • This assumption is very reasonable and is frequently exploited: this is the principle on which regression trees or nearest neighbor regression are built (Hastie, 2009).
Data and variables • Data: • A large scale standardized state assessment test • Variables: • School level Math and Reading scale score in 2010 and 2011. • School level Math and Reading cohort scale score in 2010 (for grade 4, scale score when they were in grade 3) • School level average wrong-to-right erasures in 2010 and 2011.
Data and variables • Scale scores were transformed into logit for a better sense of school level during the analysis • Dependent variable • 2011 Reading or 2011 Math • Five independent variables • 2011 Math or Reading, 2010 Math and Reading, and 2010 cohort Math and Reading. • Erasure counts were not used in the algorithm
RegLOD algorithm overview • Select a set of independent variables • Find local weights • Make a peer group for each school • Obtain empirical p-values and flag schools when criterions are met.
RegLOD Example: Grade 4 Reading Step 1. Select a set of independent variables 2011 Math (G4) IV DV 2011 Reading (G4) 2010 Reading (G4) 2010 Math (G4) 2010 Cohort Reading (G3) 2010 Cohort Math (G3) R2 = 0.99
RegLOD Example: Grade 4 Reading Step 2. Determine the local weights
RegLOD Example: Grade 4 Reading • Step 3. Select peer schools • Compute pair-wise distance using the weights. • Select peer schools within +/- 0.03 (Dist value) from a school.
RegLOD Example: Grade 4 Reading • Step 4. Obtain empirical p-value • Bootstrap 2011 Reading grade 4 scores of the peer school. • Obtain empirical p-values for replication of bootstrap and average them. • Flag a school if the empirical p-value 0.05 or less. • Flag a if the number of peer schools are 10 or less.
Compare to the results of other statistical methods for data forensic • SS: scale score analysis, e.g., 2011 scale score is predicted by 2010 scale score. • PL: performance level analysis, e.g, proportion of proficient or above in 2011 is predicted by 2010 proportion of proficient or above. • Reg: regression analysis using two subject, e.g., 2011 reading is predicted by 2011 mathematic. • Rasch: use of Rasch residual. • WR: wrong-to-right erasure count. • SSco: scale score analysis using cohort students, e.g., 2011 scale score is predicted by 2010 scale score using cohort students. • StdRes: standardized residual of multiple regression using the same variables as RegLOD analysis.
Grade 4 Reading:Comparison of Local and global outlier detection
A school with 10 or less peers • E.g., The school number 1 in the table • 2011 the wrong-to-right erasure was 96 percentile, which is rather high. • The RegLOD fond only three peer schools including the school, indicating this school is an outlier. • Large increase in percentile (26 to 95), indicating suspicious increase in score. • There are reasonable evidences that this school needs further scrutiny.
A school with many peer schools • The school number 3 in the table • The RegLOD fond 93 three peer schools including the school. • Large increase in percentile (23 to 76 percentile) • Since there are many peers, we can plot the variables with the peer schools
Comparison to peers for the IVs School is within the peer distribution 2010 Reading and cohort 2010 Reading are around 20 percentile
Comparison to peers with 2011 Reading score (DV) School is an outlier among the peers with 2011 Reading This school was 23 percentile with 2010 cohort Reading and 76 percentile with 2011 Reading
A lower achieving school with many peer schools • The school number 12 in the table • The RegLOD fond 147 peer schools including the school. • Moderate increase in percentile (13 to 42 percentile) • The 2010 cohort math’s 48 percentile seems little strange given that 2011 Math is 14 percentile. • Erasure was 96 percentile, which is rather high. • We can take a look at the histograms
Comparison to peers for the IVs School is within the peer distribution 2010 Math is in right tail because of the odd high percentile. Other than that, the school is well within the peer distribution.
Comparison to peers with 2011 Reading score (DV) School is an outlier among the peers with 2011 Reading This school had a moderate increase in percentile, but since it is an outlier compare to the peers, it is a local outlier.
Did all flagged schools exhibited suspicious behavior? • 12 schools – potentially incorrectly (extremely high/low achievement) • Majority of the flagged schools exhibited suspicious behavior. • Some flagged schools by RegLOD were also flagged by other statistical methods – these were local and also global outliers. • Other schools were flagged by RegLOD but not by other statistical method – these schools were local outliers but not global outliers.
Conclusion • RegLOD have shown great promise in data forensic and it is a valuable addition to our data forensic tools. • Its applicability is not limited to cheating detection in educational testing. • Given its robust design of RegLOD, specifically its model-based design (the concept of dependent and independent variables in data mining) and its ability to adapt makes it applicable to a wide range of outlier detection problems. • We continue to study its capabilities, extend and apply it to other contexts and tasks.