200 likes | 207 Views
This tool helps with metadata collection, storage, and organization of data resources in a generic way. It also allows for merging and standardizing variables using a fusion tool. Integration with e-Stat for model-building and pre-analysis adjustments, as well as collaboration with DAMES services.
E N D
For the e-Stat meeting of 27 Sept 2010 Paul Lambert / DAMES Node inputs
1) Progress updates • DAMES Node services of a hopefully generic/transferable nature • GESDE services on occupations, educational qualifications and ethnicity (www.dames.org.uk) • Data curation tool • Data fusion tool for merging data and recoding/standardising variables
GESDE: online services for data coordination/organisation Tools for handing variables in social science data Recoding measures; standardisation / harmonisation; Linking; Curating DIR workshop: Handling Social Science Data
The data curation tool The curation tool obtains metadata and supports the storage and organisation of data resources in a more generic way
Currently: Expected inputs to e-Stat, Autumn 2010 First applications in integrating DAMES data preparation tools with e-Stat model-building systems • {Coordination/planning on WP1.6 workflow tools for pre-analysis} (?De Roure, McDonald, Michaelides, Lambert, Goldstein, Southampton RA?) • Template construction with applications using variable recodes and other pre-analysis adjustments from DAMES systems with view to generating generic template facilities • Preparation of some ‘typical’ example survey data/models (e.g. 10k+ cases, 50+ variables) and their implementation in e-Stat e.g. Cross-national/longitudinal comparability examples • Possible e-Stat inputs to DAMES workshops (Nov 24-5/Jan 25-6)
7a) Links with DAMES • ..DAMES Node core funding period is Feb 2008- Jan 2011.. Further discussion of integrating pre-analysis services from DAMES into e-Stat facilities and templates Appetite for other application-oriented contributions? • Alternative measures for the ‘changing circumstances during childhood’ application? • ?Preparation of illustrative application(s) with complex survey data • Would need data spec. and broad analytical plan
Pre-analysis options associated with DAMES Things that could be facilitated by the fusion tool (R scripts) in combination with the curation tool and if relevant specialist data (e.g. from GESDE) • Alternative measures/derived data • [via deterministic matches/variable transformation routines] • Using GESDE: Occupations, educational quals, ethnicity • (?Health oriented measures using Obesity e-Lab dropbox facility?) • Generic routines: Arithmetic standardisation tools • Replicability of measurement construction (e.g. syntax log of tasks) • Other possible data/review possibilities • [new but easy] Routine for summarizing data (see wish list) • [new, probably not easy] Weighting data options; routine for identifying values with high leverage / high residuals • (?provided elsewhere) Probabilistic matching routines
Model for data locations? • ‘Curation tool’ can be used to attach variable names and metadata to facilitate variable processing • We then have a model of storing the data in a secure remote location (‘irods server’), from where jobs can be run on it (e.g. in R) • Is this a suitable model for e-Stat? • Is there another data location model? • Or better to supply scripts to run on files in an unspecified location?
Mechanism 1: Deterministic link • Here information is joined on the basis of exact matching values • Example condor job: universe=vanilla executable = /usr/bin/R arguments = --slave --vanilla --file=bhps_test.R --args /home/pl3/condor/condor_5/wave1.dta /home/pl3/condor/condor_5/wave17.dta /home/pl3/condor/condor_5/bhps_combined.dat pid wave file pid wave file notification = Never log = test1.log output = test1.out error = test1.err queue
The input files here are Stata format data • The output is plain text format data • There are 3 linking variables, which happen to have the same names on both files • Ie ‘pid wave file’ on file 1, and also ‘pid wave file’ on file 2 • Different names would be fine but the same number of variables on both files is essential • Different total numbers of linking variables are fine (most often there is only 1) • Different R templates can be used to read data in different formats (e.g. Stata, SPSS, plain text), though exported data can only be readily supplied in plain text
The R template being run in the above application is: args <- as.factor(commandArgs(trailingOnly = TRUE)); options(useFancyQuotes=TRUE) fileAinp <- as.character(args[1]) fileBinp <- as.character(args[2]) fileCout <- as.character(args[3]) ## library(foreign) fileA <- read.dta(fileAinp, convert.factors=F) fileB <- read.dta(fileBinp, convert.factors=F) nargs <- sum(!is.na(args)) allvars <- args[4:nargs] nargs2 <- (sum(!is.na(allvars))) first_vars <- as.character(allvars[1:(nargs2/2)]) second_vars <- as.character(allvars[((nargs2/2)+1):nargs2]) ###### combined2 <- merge(fileA, fileB, by.x=c(first_vars), by.y=c(second_vars), all.x=T, all.y=F, sort=F, suffixes = c(".x",".y") ) ###### write.table(combined2, file=fileCout, col.names=TRUE , sep=",") ###
Mechanism 2: Probabilistic link • This is when data form different files are linked on criteria which are not just an exact match of values, but include some probabilistic algorithm • E.g. for each person in data 1 with the same characteristics, select a random person from the pool of people in data 2 who are age 35-40, male, education = high, marital status=married, and link their voting preference data to the person in data 1 • Other implementation requirements are equivalent to deterministic matching, so long as criteria for the matching algorithm is determined • Status: We don’t yet have a pool of probabilistic matching algorithms; we’ve one so far, which is random matching as in the above example
Mechanism 3: Recoding/Transforming • Here the scenario is the application of an externally provided data recode, or other externally instructed arithmetic operation, onto a variable within data 1 • E.g. take the educational qualifications measure which is coded 1 to 20 in data 1; recode 1 thru 5 to the value 1, 6 thru 10 to the value 2, and all others to the value 3 (this is statistically equivalent to a deterministic match, but some recode inputs may not list every possible value) • E.g. take the measure of income and calculate its mean standardised values within subgroups defined by regions (e.g. minus regional mean, divided by regional standard deviation) • Status/Requirement: We need to develop a suitable mechanism to take recode style information/instructions from relevant external sources, and convert it into a suitable format for applying either a ‘recode’ or ‘merge’ routine in R • We’d like to support: • Recode information supplied via SPSS and Stata syntax specifications; data file matrices; and, potentially, manual specifications • Other transformation procedures supplied in advance from a small range of possibilities (e.g. mean standardisation; log transformation, cropping of extreme values) plus a small set of related arguments (e.g. category variables)
Recode examples: Stata syntax: recode var1 1/5=1 6/10=2 *=3, generate(var2) SPSS syntax: recode var1 (1 thru 5=1) (6 thru 10=2) (else=3) /into=var2. Data matrix format: -> Manual entry interface (SPSS example):
=> Linking data management services into the e-Stat template Add ‘data review’ and ‘data construction’ elements, plus possible additional requests for modelling options • Data review: single script with minor variations on data • Data construction: As above, these involve variable operations and linkages with other files/resources • Derive measure on occupations, educational qualifications or ethnicity given information on the character of existing data • Collected via the cutation tool, or, more realistically, from a short range of pre-supplied alternatives? • Distributional transformations including standardisation; numeric transformation; review variable distribution • Model extensions: Weight cases options; leverage review;
8d) Wish lists/Suggestions • Include tools for describing/summarizing data • Outputs from generic ‘summarize’ commands in R linked to all templates • Tool for reviewing model results / leverage, feeding back into model respecificiation • Tools for applying survey weight variables to analysis(?) • User notes for models constructed (‘What was that?’) • Of benefit to novice and advanced practitioners • Potentially a part of the e-notebook, but could be a linked online guide (static) • E-Stat commands to provide documentation for replication • Terminologies used for the model/other user notes • Software equivalents or near equivalents (including estimator specs) • Algebraic expression and model abstract • Tools for storing/compiling multiple model results • (mentioned previously, cf. ‘est table’ in Stata)