150 likes | 283 Views
A Metadata Binding Store for Distributed Scientific Data. Yin Chen, Malcolm Atkinson, Stuart Aitken Dec. 2009. UK e-Science All Hands Meeting 2009, Oxford, 08 Dec. 2009. MOTIVATION. Scientific data/metadata are generated at great speed and high volume.
E N D
A Metadata Binding Store for Distributed Scientific Data Yin Chen, Malcolm Atkinson, Stuart Aitken Dec. 2009 UK e-Science All Hands Meeting 2009, Oxford, 08 Dec. 2009
MOTIVATION • Scientific data/metadata are generated at great speed and high volume • Metadata are the key to data access, discovery, preservation, provenance, interpretation • Data and Metadata are often created independently • We view the relationship between data and metadata as a binding • Hypothesis: A binding service is useful to serve various scales distributed scientific data
IS BINDING A PROBLEM? Genepaint Robotics 14.5 days mouse embryo Section slides Automatic ISHs (8 EU Bio labs) • EurExpress Project, EU funded under FP6, 2005-2009. • Aim to capture >20,000 gene via RNA in situ hybridization (ISH). • Generate digital ‘transcriptome atlas’ High resolution gene express images ISH management (LIME system) Gene Expression Data Repository Template Meta data Images Annotation (FIATAS) Alicante • Nov.2009: 19,411 assay, 15,715 annotations, ~5TB data
REAL WORLD OBSERVATIONS • Information inconsistency • Significant human operating errors • Consistency checking became more difficult as data increased The Numbers of gene expression images without metadata The Numbers of probe genes miss-matched with the template design • The bindings have to be efficiently managed!
DESIGN PRINCIPLES • A binding system manages bindings • Federate references of data and metadata • Data warehousing approach is no longer feasible • Data become too large, too dynamic, too unwieldy to copy • No permit to copy • Refreshness • Generic approach, independent from data resources • Can be combined with other services • Allow binding sharing among user communities, scalable • Design principle: Simple • Minimize internal complexity: no conflict • Maximize external integrity: less overlap
A SIMPLE BINDING STORE • Binding Data Model • Binding ID – UUID, need no central registration authority, unlimited • Binding subject/object – URIs, used by most web accessible data resources • Binding description – Tags, efficient, flexible • Binding APIs • Manipulation operations • Discovery operations • Delivery operations
IMPLEMENTATION • Grid tech. OGSA-DAI • OGSA-DAI server activities • OGSA-DAI client activities • OGSA-DAI client toolkits • Service Proxy APIs, programmable interface for users • Command-line UI • Not included in current work
Evaluation • Use workload modelling and simulation method • No available binding data • Observations from wwwPDB, BADC, EurExpress, NanoCMOS, Flickr • Creation patterns, access patterns, and content patterns are observed • Simulation of the real-world observations
WORKLOAD MODELLING Number of Annotation per day New PDB Structure per Month Number of Data File per day CreationWorkloads Number of Access per day Tag Behaviours Access Workloads
WORKLOAD SIMULATION Probability of the intervals occurrence Hidden Markov Model Two Poisson Processes, Two Uniform Dist. Poisson Process: Uniform Dist.: Trend: Weibull Dist. Zipf’s Dist. α=0.2 Zipf’s Dist. α=0.9 Zipf’s Dist. α=0.4
EXPERIMENT SETUP • Inter(R) Core2 2.66GHz, RAM 7GB, 144GB HD, 100Mbps network conn, Red Hat 4.1, Tomcat 5.5, OD 3.1, MySQL 6.0, R 2.9. • SSJ, Colt, benchmark script • 10 runs per configuration, collected Means, SEs, 95% CIs
EXPERIMENT RESULTS • Robust to different types of workloads • Robust to small ~ large scale workloads • Robust to both independent and combined workloads • Stressed by the Ultra scale workloads
FUTURE WORK A Scalable Binding Store • Cloud Computing promises to be scalable • Our Evaluation of the Hadoop
BINDING APPLICATIONS • Web move to web3.0 • Binding index • Combine with metadata management tools • Mashup applications
ACKNOWLEDGEMENT • National e-Science Center, research group, support team, middleware team • MRC HGU Biomedical Statistical Analyse Section: Prof Richard Baldock, Dr Duncan Davidson • Newcastle HDBR: Prof Susan Lindsay, Steven N. Lisgo • EDINA Geo Research & Data Library: Chris Higgins, Dr David Medyckyj-Scott • Data resourses: DGEMap, EurExpress Prof Richard Baldock, Lalit Kumar, NanoCMOS Dr Clive Davenhall, Prof Richard Sinnott • Technique support: OGSA-DAI team • Research materials: COBrA-CT, OntoGrid Prof Carole Goble, Dr Oscar Corcho, MyGrid Dr Phillip Lord