160 likes | 230 Views
Multiple Databases or Tabfiles?. Frances Williams ISER, University of Essex. Introduction. Overview of problem Previous solution Proposed solution Reluctance to Change Why we took the plunge How things went Are tabfiles redundant? Conclusion. Overview of problem.
E N D
Multiple Databases or Tabfiles? Frances Williams ISER, University of Essex SIR UK Conference London, June 2007
Introduction • Overview of problem • Previous solution • Proposed solution • Reluctance to Change • Why we took the plunge • How things went • Are tabfiles redundant? • Conclusion SIR UK Conference London, June 2007
Overview of problem • British Household Panel Study (BHPS) • Longitudinal study • Interview same people every year • 16th wave • Two separate database structures • Survey Database for cleaning • User database for release to research community • Used SIR2K SIR UK Conference London, June 2007
Overview of problem (cont) • Survey database (SDB) • One for each wave • Organised for cleaning purposes • Highly structured • Reflects questionnaire • Variable names reflect question numbers SIR UK Conference London, June 2007
Overview of problem (cont) • User database (UDB) • One database incorporating all waves • Flattened structure • Easier for users • Mnemonic variable names • Consistent across time • Year prefix indicates the wave • A – wave 1, B - wave 2 etc • Lots of non-questionnaire variables • Derived variables • Imputed variables • Weighting variables SIR UK Conference London, June 2007
Overview of problem (cont) • Need to convert Survey database into User database • Done on annual basis • When data are clean • Large number of variables to transpose • 40% variables change each year • Some variables merged • Numeric and DK/Ref variables • ~1500 variables on 40+ record types -> ~1300 variables on 7 record types • Some values changed • DK, Refused, Missing and Inapplicable given negative values SIR UK Conference London, June 2007
Previous solution • Developed 15 years ago in SIR2.2 • Uses intermediate tabfile • Stage 1 • Variables from SDB converted to UDB vars (minus wave prefix) • Saved in table equivalent to UDB record • Stage 2 • Tabfile rows converted to UDB database records SIR UK Conference London, June 2007
Previous solution (cont) • Disadvantages • Had to be done in 2 stages • Stage 1 using SDB • Stage 2 building UDB • Time consuming • Each process took over an hour to run • If SDB changed, both process had to be re-run • Error prone • Two places for errors to occur • Extra maintenance • Annual changes to be done in two places SIR UK Conference London, June 2007
Proposed solution • Change to SIR2002 • Allows multiple databases in a retrieval • Would allow one-stage transposition • Quicker • Easier • Less error prone • Less annual maintenance • Less space intensive SIR UK Conference London, June 2007
Reluctance to Change • Lots of work to convert system • 15 databases • 28 tabfiles • SIR 2K doesn’t export correctly • Numerics can get split over two lines • Need to hand-edit export files • Tedious process to fix • Would I be able to get the multiple database code to work? • Solaris • Caseless databases • Will existing code compile? • Conversion from SIR 2.2 to SIR 2K required lots of code revision SIR UK Conference London, June 2007
Why we took the plunge • SIR Oz would not fix SIR 2K export problem • Either stick with SIR 2K indefinitely or deal with problem • The longer we left it the more databases and tabfiles there would be to convert • Multiple databases and other new features in SIR2002 would be very useful in other areas of our systems SIR UK Conference London, June 2007
How things went • It was very long and tedious converting all the databases and tabfiles • Got there in the end • There were some compilation problems • Not as many as when we went from SIR 2.2 to SIR 2K • Do repeat var = A to B doesn’t work when A and B are the same SIR UK Conference London, June 2007
How things went (cont) • Some problems in getting multiple databases to work • Lots of help from SIR Oz • Resultant code much better • One stage process • More maintainable • Much faster • Well worth the effort • Wouldn’t want to do it again • But glad we did do it! SIR UK Conference London, June 2007
Are tabfiles redundant? • Three common situations where I have used tabfiles • Transposition of data from one database to another • Worth updating to SIR 2002 or SIR XS and rewriting code • Static file containing data to be accessed in several different ways by different applications • In the BHPS we have a large number of these • In a new system I would use databases • Not worth rewriting SIR UK Conference London, June 2007
Are tabfiles redundant? (cont) • Temporary file to hold external data • Read in from text file – e.g .csv file • Stored in tabfile and indexed as required • Dropped at end of application • Much easier to use tabfile than database • No schema required • Can be created and dropped ‘on the fly’ SIR UK Conference London, June 2007
Conclusion • Multiple databases are great • Remove need for intermediate tabfiles when accessing data from more than one database • Tabfiles have their uses • Temporary store for external data • Horses for courses • SIR Oz - don’t ever get rid of tabfiles! SIR UK Conference London, June 2007