E N D
Q2008 - ROME, 09-11 JULY 2008Implementation and evaluation of imputation strategies to improve the data accuracy The case of Italian students data from the Programme for International Student Assessment (PISA 2003)Claudio Quintano, Rosalia Castellano, Sergio LongobardiUniversity of Naples “Parthenope”claudio.quintano@uniparthenope.it; lia.castellano@uniparthenope.it sergio.longobardi@uniparthenope.it
OUTLINES IMPROVING THE ACCURACY OF ITALIAN DATA FROMOECD’s “Programme for International Student Assessment” (PISA 2003) BY DEVELOPING IMPUTATION STRATEGIES TO REDUCE THE NON-SAMPLING ERROR OF PARTIAL NON RESPONSES
PISA 2003 The OECD’s PISA “Programme for International Student Assessment” survey is an internationally standardised assessment administered to 15 years old students 41 Countries (20 European Union members) The survey involves 276.165 students (11.639 in Italy) 10.274 schools (406 in Italy)
PISA 2003 The survey assesses the students’ competencies in three areas Scientific literacy Reading literacy Mathematical literacy
AVAILABLE DATA STUDENT DATASET FAMILY ENVIRONMENT OF STUDENT The OECD collects data on SCHOOL DATASET SCHOOL CHARACTERISTICS
Multilevel (school and student) model with 4 covariates ITALY: EXCLUDED STUDENT UNITS (8%) AS ONE OR MORE STUDENT OR SCHOOL VARIABLES ARE MISSING
Multilevel (school and student) model with 29 covariates ITALY: EXCLUDED STUDENT UNITS (81%) AS ONE OR MORE STUDENT OR SCHOOL VARIABLES ARE MISSING
STEPS OF ANALYSIS Missing data pattern Imputation strategies Evaluation of results
OECD’S PISA DATASET TWO SUBSETS OF VARIABLES COLLECTED VARIABLES DERIVED VARIABLES Computed on collected variables (by linear combination or factorial analysis). This increases the potentialities of the survey Datacollectedby student and school questionnaires
EXAMPLE OF DERIVED VARIABLES The PISA 2003 index of confidence in ICT internet tasks is derived from students’ responses to the five items. All items are inverted for IRT scaling and positive values on this index indicate high self-confidence in ICT internet tasks The PISA 2003 index of school size (SCHLSIZE) is derived from summing school principals’ responses to the number of girls and boys at a school The PISA 2003 index of availability of computers (RATCOMP) is derived from school principals’ responses to the items measuring the availability of computers. It is calculated by dividing the number of computers at school by the number of students at school
FIVE IMPUTATION PROCEDURES Iterative and sequential multiple regression applied to each section of student questionnaire PROCEDURE A Iterative and sequential multiple regression applied to imputation classes computed by a regression tree PROCEDURE B Random selection of donors within imputation classes computed by a regression tree PROCEDURE C Random selection of donors withinimputation classes computed by a regression tree for each section of the studentquestionnaire PROCEDURE D Iterative and sequential multiple regression appliedto whole dataset PROCEDURE E
USUAL ASSOCIATIONS AND ANTINOMIES OF ADOPTED IMPUTATION PROCEDURES (A-E) ALL PROCEDURES ARE BELONGING TO CATEGORIES USUALLY WELL KNOWN TWO CATEGORIES ARE INVOLVED: REGRESSION METHODS (A,B,E) AND DONORS METHODS (C,D) DIMENSION OF TREATED DATA MATRIX. THE IMPUTATION PROCEDURE IS (A,D) / IS NOT (B,C,E) PUT ON EACH SECTIONS OF THE QUESTIONNAIRE TWO DATA MATRIX SIDES ARE INVOLVED: UNITS (Classification And Regression Tree B,C,D) AND VARIABLES (A,D) MISSING DATA MECHANISM IS (A,E) / IS NOT CONSIDERED (B,C,D)
Iterative and sequential multiple regression (Raghunatahan et al. 2001) on each section of student questionnaire PROCEDURE A The data matrix is partitioned in the seven sections of student questionnaireThe features ofeach section, as partition of data matrix: • Strong logical links between the questions • Homogeneous structure of association and relationship • Homogeneous presence of missing data
Iterative and sequential multiple,regression applied to imputation classes computed by a regression tree PROCEDURE B • DEPENDENT VARIABLE • Missing data for each student • PREDICTORS • Selected from five categories of derived indicators θ: • Family background • Scholastic context • Approach to study • Attitudes toward ICT struments • Performance scores STEP I UNITS CLASSIFICATION Computation of regression tree (14 terminal nodes) Each terminal node of the tree is considered as imputation class Their missing values are imputed by iterative and sequential regression model (Raghunatahan et al. 2001) STEP II IMPUTATION
Random selection of donors inside of imputation classes computed by a regression tree PROCEDURE C • DEPENDENT VARIABLE • Missing data for each student • PREDICTORS • Selected from five categories of derived indicators θ: • Family background • Scholastic context • Approach to study • Attitudes toward ICT struments • Performance scores STEP I UNITS CLASSIFICATION Computation of regression tree (14 terminal nodes) A different donor is selected to impute each missing value of each student The donor is selected randomly from the same node STEP II IMPUTATION
Random selection of donors within imputation classes computed by a regression tree for each section of the studentquestionnaire PROCEDURE D THE DATA MATRIX IS PARTITIONED IN THE SEVEN SECTIONS OF STUDENT QUESTIONNAIRE STEP I Matrix partition STEP II Units Classification A REGRESSION TREE IS PRODUCED WITHIN EACH PARTITION OF THE MATRIX (see the next slide) WITHIN ALL LEAVES, A DIFFER DONOR IS SELECTED TO IMPUTE EACH MISSING VALUE OF EACH STUDENT THE DONOR IS SELECTED RANDOMLY FROM THE SAME NODE STEP III Imputation
PROCEDURE E ITERATIVE AND SEQUENTIAL MULTIPLE REGRESSION (Raghunatahan et al. 2001) ON THE WHOLE DATASET (without any partition of units and variables)
Classification And Regression Tree Classification and Regression Tree creates a tree-based classification model. It classifies cases into groups or predicts values of a dependent (target) variable based (Y) on values of independent (predictor) variables (X) PARENT NODE The classification is obtained through the recursive binary partition of the measurement space and containing subgroups (NODES) of the target variable values internally homogeneous, correspond to imputation cells CHILD NODE TERMINAL NODE CREATE IMPUTATION CELLS
STRUCTURE OF A REGRESSION TREE Impurity of a node t Example: A tree T composed of five nodes ti i=1,2,3,4,5 t1 t2 t3 For any split s of t into tL and tR, the best split s* is such that t5 t4
ITERATIVE AND SEQUENTIAL MULTIPLE REGRESSION (1/2) The variable with the fewest number of missing values -Y1 – is regressed on the subset of variables without missing data U=X Variables without missing data -Y- PARTITION OF THE VARIABLES STEP 1 Variables with missing data -X- STEP 2 Update Uby appending Y1 Then the next fewest missing values Y2 is regressed on U = (X, Y1) where Y1 has imputed values STEP 3 …….. Each variable is imputed by using all available variables (completed or imputed) STEP N
ITERATIVE AND SEQUENTIAL MULTIPLE REGRESSION (2/2) NEXT ROUND THE IMPUTATION PROCESS IS THEN REPEATED MODIFYING THE PREDICTOR SET TO INCLUDE ALL X AND Y VARIABLES EXCEPT THE ONE USED AS THE DEPENDENT VARIABLE ALL MISSING DATA ARE IMPUTED FOR EACH VARIABLE
EVALUATION OF IMPUTATION PROCEDURES IMPACT ON UNIVARIATE DISTRIBUTIONS RELATIONSHIP BETWEEN VARIABLES
IMPUTATION EFFECTS ON UNIVARIATE DISTRIBUTIONS N denotes the number of categorical variables CATEGORICAL VARIABLES ABSOLUTE RELATIVE SQUARE DISSIMILARITIES INDEX (LETI 1983) CONTINUOUS VARIABLES ABSOLUTE RELATIVE VARIATION INDEX (AMONG STANDARD DEVIATIONS) ABSOLUTE RELATIVE VARIATION INDEX (AMONG MEANS) the education survey data have analysed with multilevel models.
IMPUTATION EFFECTS ON RELATIONSHIP AMONG VARIABLES (1/2) Variation Association Index (categorical variables) Mean difference for each imputed variable (Yj) between the association pre and post imputation of Yj vs remaining n-1 categorical variables
IMPUTATION EFFECTS ON RELATIONSHIP AMONG VARIABLES (2/2) Variation Association Index (continuous variables) Mean difference for each imputed variable (Yj) beetwen the correlation pre and post imputation of Yj vs remaining n-1 continuous variables
SCORES MATRICES 1 if gjs is the minimum value in the row j Each of five matrix VPG(Nx5) -whose Gjs is a generic element- is transformed in ajs (0,1) score matrix SI(Nx5) with ajs 0 otherwise j min{ gjs} ajs=1; s:gjs≠min{ gjs} ajs=0
BUILDING A RANKING INDICATOR(1/3) The ranking indicators measure the relative performance of each procedure according to each evaluation index
BUILDING A RANKING INDICATOR(2/3) The vector of 0,1 scores extracted from the S matrix (for each procedure and for each evaluation indicator) is reduced to a scalar as a sum of its elements This sum is divided by the number of vector elements to obtain a ranking index R whose range is 0,1
BUILDING A RANKING INDICATOR(3/3) The ranking indicators measure the relative performance of each procedure according to each evaluation index Lowest performanceof sth procedure compared to other ones for generic evaluation index G Highest performanceof sth procedure compared to other procedures for generic evaluation index G
FROM AN EVALUATION INDICATOR TO A RANKING INDICATOR
EVALUATING THE IMPACT ON MARGINAL DISTRIBUTIONS AND ON SOME DISTRIBUTIVE PARAMETERS
EVALUATING THE IMPUTATION IMPACT ON THE VARIABLESASSOCIATION
CONCLUDING REMARKS MISSING DATA IMPUTATION IS AN EXTREMELY COMPLEX PROCESS EACH METHOD SHOWS CRITICAL ASPECTS • IT IS IMPORTANT TO DEVELOP A RECONTRUCTION STRATEGY CONSIDERING SOME BASIC ASPECTS: • THE MISSING DATA PATTERN • THE IMPACT ON THE STATISTICAL DISTRIBUTIONS • THE IMPACT ON THE ASSOCIATIONS AMONG VARIABLES