100 likes | 175 Views
SHARE data cleaning meeting Frankfurt – December, 6 Some suggestions from the Italian experience Paccagnella Omar. Omar Paccagnella Data cleaning meeting December 6, 2007. How to proceed?. Data cleaning as a whole can be divided in 2 stages:
E N D
SHARE data cleaning meeting Frankfurt – December, 6 Some suggestions from the Italian experience Paccagnella Omar Omar Paccagnella Data cleaning meeting December 6, 2007
How to proceed? • Data cleaning as a whole can be divided in 2 stages: • The frame • All about identification of households and/or individuals • (id’s – demo characteristics – household composition) • The picture • All about individual characteristics and answers • (that can be checked !) Omar Paccagnella Data cleaning meeting December 6, 2007
The frame (1) Are you sure that the interviewed household/individual is the one you want ? • Longitudinal sample - IWER has to contact the same wave1 hh (wrong address? selection errors in the SMS by IWER?) … looking at the CV (name, gender, year of birth, children) • Refresher sample - IWER has to fill in at the end of the CV the selected individual (other 50+ in the hh?) • … looking at the CV (name, gender, year of birth) • … sample representation; oversampling 1955/1956. Omar Paccagnella Data cleaning meeting December 6, 2007
The frame (2) • Are you sure that in the CV of the household • ALL eligible individuals are reported? • Longitudinal sample - You need to know what happened to all wave1 individuals: … w1 individuals not in the w2 CV: deceased? Moved out? … w1 individuals both deceased and moved out in the w2 CV: check for linking errors (exit instead of longitudinal interview) • … w1 individuals indicated in the w2 CV as moved in: • check for the id and type of interview (baseline vs longitudinal) • … w1 individuals indicated as “New hh members” after w2 CV: • check for the id and type of interview (baseline vs longitudinal) … w2 individuals not in the w1 CV: moved in questions completed? Omar Paccagnella Data cleaning meeting December 6, 2007
The frame (2) • Are you sure that in the CV of the household • ALL eligible individuals are reported? • Refresher sample - You need to know whether all household members are reported … this can be checked only when the sample selection is based on hh instead of individuals or other hh information is available Omar Paccagnella Data cleaning meeting December 6, 2007
The frame (3) • Are you sure that demographic information • matches correctly within and between waves? • Within waves - Check mixing up of respondents: e.g. the interview to the husband was done in the SMS row of the wife (refresher vs longitudinal) … gender & year of birth must be the same in CV, DN, XT sections and drop-off (where available) • Between waves • - Check mixing up of respondents, e.g. the name of the husband was • linked with the name of his wife • … gender & year of birth must be the same in CV, DN, XT sections • and drop-off (where available) Omar Paccagnella Data cleaning meeting December 6, 2007
The frame: summing up • Check that there is no household different from the selected (this also means that at least one household member must have the same gender and year of birth of the selected individual in every wave) • Check that wave1 eligible individuals are not “forgotten” in wave2 • Check that id’s of the eligible individuals are properly merged Omar Paccagnella Data cleaning meeting December 6, 2007
To complete the frame … … check and clean interviewer characteristics ! In CV there is the “org” variable, but the characteristics of IWER who completes the interview is only in IV section: • Be sure that the same IWER has a unique id number (small/capital • letter, spaces, numbers, etc.) • Checkage, gender and education for the same IWER (in wave1 • there were some interviews where IWER reported the respondent • characteristics instead of his/hers) Omar Paccagnella Data cleaning meeting December 6, 2007
The picture Check outliers, DK, RF and all values that can be compared with other sources • Amounts: too large/small values; “0” values; results by IWER • Physical & cognitive tests: too large values; value of 1 in the “Ten words recall test” (total number of words instead of cited words); tests non completed; rounding off of results; same results across trials; results by IWER • Children: are their age/year of some events compatible with the age of respondents? • Other in answer categories: may the answer be recoded in one category already defined? A large number of “other”: do we miss something? Omar Paccagnella Data cleaning meeting December 6, 2007
Some final thoughts Data cleaning is not only the corrections of some errors, but it is a way to check and evaluate the quality of our datasets: we can find sections where data are less good (compared to other similar surveys), the variables that need more attention (both analyzing the data and preparing the briefings). A good data cleaning starts at the beginning of the field Omar Paccagnella Data cleaning meeting December 6, 2007