190 likes | 381 Views
A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores Claudio Quintano, Rosalia Castellano, Sergio Longobardi UNIVERSITY OF NAPLES “PARTHENOPE”
E N D
A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores Claudio Quintano, Rosalia Castellano, Sergio Longobardi UNIVERSITY OF NAPLES “PARTHENOPE” claudio.quintano@uniparthenope.it lia.castellano@uniparthenope.it sergio.longobardi@uniparthenope.it
OUTLINE This work considers data on students’ performance assessments collected by the Italian National Evaluation Institute of the Ministry of Education (INVALSI) • 5 SCHOOL LEVELS • 2th and 4th year of primary school • 1th year of lower secondary • 1th and 3th year of upper secondary THE INVALSI SURVEY 3 AREAS reading, mathematics and science • OUTLIER UNITS, at class level, which brings to biased distributions of the average scores by class • The AIM is to MITIGATE THE PRESENCE of outliers and correcting the overestimation of children ability
DISTRIBUTIONS OF MEAN SCORES AT CLASS LEVEL (MATHEMATICS ASSESSMENT) MATHEMATICS CLASS MEAN SCORE - S.Y 2004/05 I CLASS LOWER SECONDARY SCHOOL III CLASS UPPER SECONDARY SCHOOL I CLASS UPPER SECONDARY SCHOOL II CLASS PRIMARY SCHOOL IV CLASS PRIMARY SCHOOL
CLASS MEAN SCORE Reading s.y. 2004/05 Mathematics s.y. 2004/05 Science s.y. 2004/05 II CLASS - PRIMARY SCHOOL Reading s.y. 2005/06 Mathematics s.y. 2005/06 Science s.y. 2005/06
STEP I Deletion of micro units –students- considered as “PSEUDO NON RESPONDENTS” Students who haven’t given the minimum number of answers to compute a performance score The presence of these units varies from 9% to 16%
SUMMARY COMPUTATION OF CLASS LEVEL INDICATOR For each student class the following indexes are computed: Class mean score Standard deviation of mean score Class non response rate Index of answers’ homogeneity Class mean score : At first step the micro units considered as “pseudo-non respondents” have been dropped from dataset then the following indexes, at class level, are computed: Class non response rate Index of answers’ homogeneity Standard deviation of mean score NUMBER BOTH OF ITEM NON REPSONSES AND OF INVALID RESPONSES FOR THE ITH STUDENT OF THE JTH CLASS SCORE OF ITH STUDENT OF JTH CLASS GINI MEASURE OF HETEROGENEITY COMPUTED FOR EACH STHTEST QUESTION ADMINISTERED TO EACH STUDENT OF JTH CLASS NUMBER OF ADMINISTERED ITEMS TO JTH CLASS NUMBER OF RESPONDENT STUDENTS OF JTH CLASS NUMBER OF RESPONDENT STUDENTS OF JTH CLASS
PRINCIPAL COMPONENT ANALYSIS (PCA) By the PCA we are able to describe the answer behaviour of each student class through two variables FIRST Component SECOND Component Class non response rate INDEX OF CLASS COLLABORATION TO SURVEY OUTLIERS IDENTIFICATION AXIS CONTRAPOSITION
PRINCIPAL COMPONENT ANALYSIS (PCA) It is possible to detect, graphically, the outlier classes of students Projection on the first two factorial axes plane of second class primary students OUTLIER CLASSES
THE FUZZY K-MEANS APPROACH On the basis of the two factorial dimensions the students’classes are classified in 8 clusters by a FUZZY K-MEANS algorithm Computation of fuzzy partition matrix where for each students’ class (rows of the matrix) the degree of belonging to each cluster (columns of the matrix) is computed
DETECTION OF OUTLIERS High negative scores on “outliers identification axis” (x-axis) that indicates a high class average scores and minimum within variability respect to scores and test answers OUTLIER CLUSTER Projection of centroids computed by fuzzy k-means Factorial scores close to zero respect to the “index of class collaboration to survey”
DETECTION OF OUTLIERS Indicating with “a” the outlier cluster, the degree of belonging to this cluster is:µja This measure is considered as the “outlier probability” of jth class Otherwise it can be interpreted as the “outlier level” of each class
CORRECTION PROCEDURE On the basis of the outlier cluster degree, a weighting factor is developed: Wj varies from 0 to 1 The students’ class with high probability to belong to outlier cluster will have a low weight while the class very far from this cluster will have a weight close to 1 Weighting factor Outlier probability Wj =1 - µja
EFFECTS OF THE CORRECTION PROCEDURE ADJUSTED DISTRIBUTION ORIGINAL DISTRIBUTION
THE INSPIRATION PRINCIPLE OUTLIER Go over the dichotomous logic NOT OUTLIER Compute an “OUTLIER LEVEL” measure for each unit to calibrate the correction FUZZY APPROACH
RELATIONSHIP BETWEEN THE SCHOOL LOCALIZATION AND THE PRESENCE OF OUTLIER CLASSES Box plot of outlier level µja Degree to belonging to the outlier cluster (cluster n.2)
RELATIONSHIP BETWEEN THE SCHOOL LOCALIZATION AND THE PRESENCE OF OUTLIER CLASSES CLASS AVERAGE SCORE DISTRIBUTIONS ONLY FOR THE NORTHERN AND CENTRAL REGIONS
REGIONAL SCORES NOT WEIGHTED AVERAGE WEIGHTED AVERAGE
Index of answers’ homogeneity Index of answers’ homogeneity The mean of the Q Gini indexes (Esj)computed for each sth test Question administered to each student of jth class: Where Esjis a Gini measure of heterogeneity: denotes the ratio of students of jth class that has given the tth answer to sth question The Gini measure is equal to zero when all students of jth class have given the same answer to the sth question. It reaches the maximum value: h-1/h (h is the number of alternative answers to question sth) when there is perfect heterogeneity of answers to sth question in the jth class