90 likes | 213 Views
Jan Goebel. SOEP and DOI Requirements and Challenges. Content. SOEP Overview Problems Conclusions. SOEP Overview. Socio-Economic Panel Study (SOEP) is a representative longitudinal study of private households in Germany
E N D
Jan Goebel SOEP and DOIRequirements and Challenges
Content • SOEP Overview • Problems • Conclusions
SOEP Overview • Socio-Economic Panel Study (SOEP) is a representative longitudinal study of private households in Germany • Annual survey since 1984 of about 10,000 households (around 20,000 persons) • Some of the many topics include household composition, occupational biographies, employment, earnings, health and indicators of subjective well-being
SOEP is an ongoing Survey • Common with all panel surveys • Each year we distribute an enhanced version with new and changed data • Question are changing, new topics, ...→ We do a lot but not just replication! • Even changes for „archived data“, like a change in the coding scheme of ISCO
SOEP is not one dataset but a complex data structure • The SOEP currently (User DVD) consists of: • More than 320 data files • About 40.000 Variables • Granulation to choose for citation? • Complete SOEP distribution of one year? • „Connected“ SOEP parts, e.g. Individual questionnaires, HH-questionnaires, generated datasets • Each data file • Each Variable (for each year or only once, longitudinal concept?)
„The SOEP” is available in different versions • European user: 100% Version (English, German, different formats for SAS/SPSS/Stata/ASCII) • Non-EU user: 95% Version (of cases) • International comparative research: Part of the CNEF (Cross National Equivalent File) • SOEP Geocodes(supplementary CD): Regional Planning Regions, Community types, etc. • Country codes, Community codes, zip codes, microm:only by remote execution or at the Research Data Center (RDC SOEP) • SOEP Pretests • SOEP Related Studies
SOEP can change during the period, because of updates • Updates of weighting schemes or even bug fixes (also possible for older waves) • Sometimes more than one update between distributions (cumulative updates?) • How can a user know what version she is using? • Message-Digest Algorithm (MD5) • Secure Hash Algorithm (SHA-2) • Universal Numeric Fingerprint (UNF) • Does rounding matter? • German/English Labels, different formats (SPSS, STATA, …) • Only update of a label bug?
Conclusions • Nesting of DOI should be possible: • It should be possible for a user to identify the data, including version The metadata of a DOI should include a SHA for each data file and format, which must also be persistent, like SHA-2 • Commitment about the persistence of the data provider • It is not enough to identify the data source to make an scientific empirical analysis reproducible, you normally need the syntax also