290 likes | 303 Views
Explore the intersection of e-Science and social research, focusing on computational challenges like missing data, endogenous effects, and cluster effects in longitudinal studies. Learn about existing tools and the need for a Virtual Research Environment (VRE).
E N D
E-Social Science • What is e-Science? • E-Science and e-Social Science • E-Social Science and Longitudinal Data • Examples of the Computational Problems we Currently Face (BHPS, YCS) • Existing Web Based Tools and Possible New Tools • Need for a VRE • The e-Social Science Program
What is e-Science? The technology: the three exponentials Computer speeddoubles every 18 months. Our ability to model and simulate complex systems increases at the same rate; Storage density doubles every 12 months. Some groups talking about data sets that are a Petabyte in size and which will be 10 Petabytes/year in 5 years time; Network bandwidth doubles every 9 months.
What is e-Science? The GRID concept puts all three components together and makes them even more important. There will be several different types of GRID, e.g. • Computational GRIDs for high-performance computation; • Access GRIDs for collaborative visualization involving distant researchers; • Data GRIDs for moving large volumes of data; • Sensor GRIDs for real-time monitoring (e.g traffic and pedestrian flows, electronic transactions);
UK e-Science Grid Edinburgh (NeSC) Glasgow NeSC DL Newcastle Belfast Manchester Cambridge Oxford Hinxton RAL Cardiff London Southampton
Some Features of Social Science Research • We want to develop evidence basedsubstantive theory. We want to know “what determines what”, e.g. long term unemployment and social exclusion • We want to explore the consequences ofpolicy changes on individual behaviour, e.g. encouragement to stay on at school on educational attainment, truancy, and social exclusion • Data may be small (<10GB) but they are complex
Cluster Effects (CE) • Most large scale longitudinal surveys use multi-stage sample designs to obtain 'representative' samples; this procedure often creates cluster effects, e.g. BHPS (households), YCS (schools). • Elaborate procedures have been developed to take cluster effects into account by means of shared random effects in the model e.g. MLwiN, Stata (Gllamm). • The estimation of non-identity link CE models, e.g. probit, are computationally demanding. The quick approximations do not work in the presence of endogenous variables, e.g. conditional estimators.
Measurement Errors (ME) • Ignoring ME can seriously mislead the quantification of the link between explanatory and response variables; • In observational studies, it is rarely possible to measure all relevant covariates accurately, e.g. age, educational attainment; • ME in one covariate can bias the association between other covariates and the response variable, even if those other covariates are measured without error; • Repeated measures and longitudinal data provide the opportunity to deal with ME in explanatory variables, adds to the computational demands of the analysis.
Missing Data, Dropout and Selection • All of the major data sets available to the British social science community, such as the YCS, BHPS and NCDS, contain missing data and dropout. • It is mostly non-ignorable, non-ignorable missing data and dropout are a source of bias. • When there is missing data it is important to try and model, as realistically as possible, the process by which the observed subjects have been retained in the sample, otherwise we will not know whether the selection process has only retained subjects with certain characteristics. • Some sample designs create selection effects. • These features add to the computational demands of the analysis
Endogenous effects • The curse of endogenous effects, everything seems to depend on everything else. • We need multiprocess models to disentangle this complexity, adds to computation. • Longitudinal data can provide the opportunity to disentangle endogenous effects from correlated errors.
Take Just One of These Complications (Endogenous effects) • The YCS is a multi-stage stratified random sample of individuals ages 16-17. • These individuals were contacted by post three times at annual intervals, at age 16-17, 17-18 and 18-19 (sweeps 1, 2 and 3, respectively). • I use YCS6 which covers young people eligible to leave school in 1990-91 (YCS6), who are then observed over the 1992-94 period.
Part-time work and truancy are potential determinants of educational attainment • A comprehensive model will allow us to disentangle the observable, direct, effects of truancy on educational attainment from any effects that arise from correlation in the errors (unobserved effects).
Trivariate Ordered Probit Model(Path Diagram) Independent Errors (ep, et, eq)
Comprehensive Model Results • The direct effect of part time work on attainment changes sign • The correlation between pt-work & attainment has a different sign to the direct effect of pt-work, on attainment, the direct effect has also become significant. • The correlation between truancy & attainment, has the same sign as the direct effect of truancy, on attainment and direct effect reduced.
Problems and Model Extensions • Model takes up to a month to estimate on a P4, 3 linear predictors, 169 parameters, 8,496 trivariate integrals for each function evaluation. • Results change as our model becomes more comprehensive. • Need to explore other directions for the endogenous effects.
Going Parallel • Farm out the calculations for the integrals to different processors, we get linear improvements in speed; • e.g. if it takes T on one processor it takes T/200 on 200, i.e. 1 month goes to less than 4 hours. • This improvement is present all the way up to sample size, at which 1 month goes to 6 minutes.
(Existing web based tools) • Allows users to submit R jobs and get output back to their web session; • Rweb needs more menus, the extensive R statistical library, not used;
(Existing web based tools) • Allows 66 major datasets to be explored online, • Only uses one data set at a time; • Has very limited facilities for sub-setting and none for fusing; • Restricted statistical facilities, e.g. descriptive analysis, linear regression; • No facilities for handling missing data.
New Tools: Joining Up in the Analysis Cycle Main ESDS Data Sets TTWA Data, NOMIS Select Data Set and Appropriate Variables: Merge Files: Add Variables Contextual Data Working Data Results
Data Management A Data Management B Data Management C Analysis A Analysis B Analysis C Middleware New Tools: Linking Components
Check Step We need to keep our focus on the priority ordering: Scientific challenge > software > hardware • Software is more important than hardware. • Software lasts longer than hardware • Software development is lagging behind that of hardware, the ’software gap’
Problems with using the GRID • Currently requires heroic effort to use it; • GT2 is very complicated and difficult to install; • Can make other University services vulnerable if not properly managed; • User requirements not fully articulated; • Human factors not addressed, needs familiar GUI, pull down menus, etc.
We need a Virtual Research Environment “to make the use of e-Science technologies, methodologies and resources easier and more transparent for than simply developing bespoke applications on an infrastructure toolkit (such as GT2). ” JISC interested in funding the VRE
UK E-Social Science Programme There is a growing body of work and projects in this area: • Centre of Excellence in e-Social Science – DTI Core Programme; • Pilot projects – ESRC; • ReDReSS: Resource Discovery for Researchers in e-Social Science – JISC/ ESRC; • Agenda Setting Workshops – JISC/ ESRC; • UK National Grid Service + e-Science Grid - JCSR and DTI Core Programme; • NCeSS: National Centre for e-Social Science – ESRC; • QeSSSS: Quantitative e-Social Science Support Service - ESRC (+ future NCeSS nodes).