220 likes | 325 Views
Geodemographic modelling collaboration. Alex Voss, Andy Turner Presentation to Academia Sinica Centre for Survey Research 2009-04-21. Overview. Background MoSeS Population Reconstruction Dynamic Simulation Summary of on-going projects/effort Future Work Next Steps Acknowledgements.
E N D
Geodemographic modelling collaboration Alex Voss, Andy Turner Presentation to Academia Sinica Centre for Survey Research 2009-04-21
Overview • Background • MoSeS • Population Reconstruction • Dynamic Simulation • Summary of on-going projects/effort • Future Work • Next Steps • Acknowledgements
Background • Social science does not traditionally use advanced ICTs but emergence of new analytical methods is driven by: • Increased availability of data about social phenomena • Issues with data management and integration • Challenges to analyse social phenomena at scale • Challenges to inform practical policy and decision making (e.g., evidence-based policy making) • National Centre for e-Social Science (NCeSS) in the UK is investigating ways to respond to these challenges. • EUAsiaGrid is supporting e-Social Science amongst other application domains
MoSeS • MoSeS develops modeling and simulation approaches for social science • First phase research node of NCeSS, now continued through second round node GENeSIS • Contemporary demographic modeling of the UK based on UK census data and other datasets • Using agent-based simulation to project population forward in time by 25 years • Simulate the impact of distinct demographic processes such as mortality, fertility, health status, household formation, migration • E.g., to inform policy making – what impact do policy decisions have
Population Reconstruction • Generation of an individual level population data for the UK • Based on 2001 census data • Works with ‘public release’ versions of census that are restricted, • Census Aggregate Statistics at Output Area Level • 1% of population (anonymisation) • Reconstructed data has same attributes as real population and same number of individuals but is still anonymised • Uses a genetic algorithm to select a well fitting set of sample of anonymised records to assign to an output area • Need for attributes in the SAR to be matched with those in the CAS • This is often complicated because of different categories • Aggregation to a lowest common categorisation
Population Reconstruction (II) DisclosureControl Validation*
Population Reconstruction (III) • If was 0, we would have a dataset that happens to match the raw census • We are looking for a synthetic population with a ‘reasonable’ • How can we check if is ‘reasonable’?
Input Data • Example: Individual-level Samples of Anonymised Records (ISARs) • Downloadable for UK academics from http://www.ccsr.ac.uk/sars/2001/indiv/ • Click-through End-User License • Disclosure control through: • Selection • Aggregation • Permutation • No additional output checking required
Output Variables • Demographics • Employment • Health status • Household composition • Two types of constraints: • Control constraints (have to be met) • Optimisation constraints (used in fitness function)
Running Pop. Reconstruction • Can use geographic segmentation to compute each output area in parallel • Good example of high-throughput computing using master-slave architecture • Each output area takes about 5 minutes to compute (on average) • UK contains about 223,060 output areas, so a complete run would take about 775 days to complete • Different control constraints lead to different runtime behaviour for each output area • This can be reduced by applying, e.g., 128 processors, reducing the runtime to 6 days • More than likely to have errors in such a large compute job, need robust error-recovery • Realistically, it takes 2 weeks to compute a complete output
Application Porting Currently porting population reconstruction code to EGEE, investigating TW data assets and exploring other links, e.g., with healthcare research Setting up e-Social Science VO
Experiences • Integrating existing code into grid environment required some changes to source code • management of input arguments • code scalability • log management • error handling • Finding the right input size and parameters for testing to keep execution times low • Making sense of execution failures • lack of ways to debug code in distributed environments
Experiences II • Step-wise process works well, • ensures we encounter problems piece by piece • allows us to comply with data protection / licensing • Population reconstruction is resource intensive • may run up against limits on wall clock time • Importance of ‘at elbow’ support • but hindered by data protection/licensing issues • Licensing means we need to limit execution to UK resources • No e-Social Science VO for EGEE porting (yet) • Needed to get support from other VOs in the meantime
Dynamic Modelling • Daily activity modelling • Commuting • Retail modelling • Transportation • Population Forecasting • Annual time step • Birth • Death • Migration
Future Work • Next steps until code runs in Taiwan with Taiwanese data • Proof of concept execution on Quanta cluster at ASGC • Definition of data outputs from • Modularising computation so it can exploit multiple NGS nodes or EGEE CEs • Improving data and code staging • Moving from population reconstruction to supporting the simulation process • Integration into ‘science gateway’ for the social sciences and developing a repository for models
Acknowledgements • National Centre for e-Social Science • MoSeS Node: Mark Birkin (PI) • GENeSIS Node: Mike Batty (PI) • NCeSS Hub: Peter Halfpenny and Rob Procter • EUAsiaGrid Consortium • Marco Paganoni (Project Director) • CPC at Westminster University • Gabor Szmetanko • Gabor Terstyanszky • Tamas Kiss • GridPP • Jens Jensen and Jeremy Coles • National Grid Service • Jason Lander and Shiv Kaushal (Leeds), Steven Young (Oxford), Mike Jones (Manchester)