90 likes | 176 Views
Migrating from SPSS to SIR. Return from Anarchy. Jon Johnson 11 May 2005. Introduction. CLS runs 3 / 4 British Birth Cohort Studies Multi-disciplinary study of the life-course of three generations born in 1958,1970 and 2000 Data collected in various ways, paper, CAPI, administrative data
E N D
Migrating from SPSS to SIR Return from Anarchy Jon Johnson 11 May 2005
Introduction • CLS runs 3 / 4 British Birth Cohort Studies • Multi-disciplinary study of the life-course of three generations born in 1958,1970 and 2000 • Data collected in various ways, paper, CAPI, administrative data • Complex data, 100,000 variables, 18,000 participants per study
History • Punch cards, different data centres, SIR, SPSS • The data has been through the range of data storage fashions • Social science versus Medical data access models • Goal of increased accessibility and understanding of relationships within data • Development of social science meta-data standards
Current Data Collection • Data collection methods such as CAPI has a negative and positive side • Data is pre-punched • Data is pre-checked • Data is less understandable • Data is more complicated • Recent data supplied for one sweep was > 100,000 variables
Taming data • Datasets are routinely supplied in SPSS format • SPSS is not an ideal environment to manage such data • SIR is an ideal environment to manage this data
Data Migration with minimum information loss • SPSS Data List • Rarely used, high level of manual intervention • Visual Basic (a.k.a. SaxBasic) • Platform dependent • Limited functionality, multi-step process • ODBC • Flaky at best • Reverse engineer SPSS file • SPSS Portable format - stable if poorly documented format
Implementation • PQL, Perl, Python ? • Stable across OS’s • Good text manipulation • Good XML support • Case based databases
How it works • parse spss file • grabs variable name, value labels, data values etc • looks up a configuration file for BDI settings • check if also setting up database or just adding a new record • do some conversions: time, date, scaled vars • do some analysis of the data to grab range of values, • write out warning if > 3 missing values or a range of missing values • write out schema • python spss_parser.py -f <input filename> -s <sir config file> -d <ddi config file>
Use • Once into SIR the data can be restructured • Extend to other datasets held in other statistical packages such as Stata or SAS going via StatTransfer -> SPSS portable format and go from there • Also creates XML to add to a data store - superseded !!!