210 likes | 327 Views
The Grid Observatory: goals and challenges. C. Germain-Renaud (CNRS/LRI & LAL) EGEE’07 Conference Budapest, Hungary 1-5 October 2007. Overview. NA4 cluster in EGEE-III proposal
E N D
The Grid Observatory: goals and challenges C. Germain-Renaud (CNRS/LRI & LAL) EGEE’07 Conference Budapest, Hungary 1-5 October 2007
Overview • NA4 cluster in EGEE-III proposal • Integrate the collection of data on the behaviour of the EGEE grid and users with the development of models and of an ontology for the domain knowledge Application Track - Grid Observatory
Some immediate questions • Ressource allocation • Performance of the gLite scheduling hierarchy • Published waiting time • Reactive grids – Everybody's grid • Dimensioning • Patterns and trends in requests and usage • Anticipate peaks • On-line fault management • Detection • Diagnosis • Prevention Application Track - Grid Observatory
The big picture • Considering current technologies, we expect that the total number of device administrators will exceed 220 millions by 2010 – Gartner June 2001 • No more Moore’s Law free lunch: much more complex software & applications • The Virtual Organization concept creates common goods Application Track - Grid Observatory
Autonomic Computing Computing systems that manage themselves in accordance with high-level objectives from humans. Kephart & Chess A vision of Autonomic Computing, IEEE Computer 2003 • Self-*: configuration, optimization, healing, protection • Of open non steady state dynamic systems Application Track - Grid Observatory
Autonomic Computing Computing systems that manage themselves in accordance with high-level objectives from humans. Kephart & Chess A vision of Autonomic Computing, IEEE Computer 2003 • Self-*: configuration, optimization, healing, protection • Of open non steady state dynamic systems • Academic and industry involved Application Track - Grid Observatory
execute analyze plan knowledge monitor Autonomic Grids • Statistical analysis • Data mining • Machine learning DATA REQUIRED Application Track - Grid Observatory
Data Collection and Publication • Acquisition, consolidation, long-term conservation of traces of EGEE activities • Permanent storage of reliable, exhaustive, filtered information • Exhaustive: added value in snapshots of the inputs and grid state e.g. workload and available services during a relevant time range • Filtered: from operational to structured L&B schema No join ! Application Track - Grid Observatory
Data Collection and Publication • Acquisition, consolidation, long-term conservation of traces of EGEE activities • Permanent storage of reliable, exhaustive, filtered information: from operational to structured • No monitoring development: rich ecosystem of sources, with very different scopes, deployment and institutional status • Centralized Application Track - Grid Observatory CIC tools (GOCDB, SAM, SFT,…), core gLite (L&B, BDII,…) sites (Maui/PBS logs) gLite integrators (R-GMA, Job Provenance) experience integrators (DashBoard) external software (MonaLisa)
Data Collection and Publication • Acquisition, consolidation, long-term conservation of traces of EGEE activities • Permanent storage of reliable, exhaustive, filtered information: from operational to structured • No monitoring development: rich ecosystem of sources, with very different scopes, deployment and institutional status • The major challenge is exhaustive • Some data are outside the scope: external traffic on shared resources • Inside the scope, we need snapshots of the grid state and inputs • Privacy related legal constraints • Scientific usage will help • Interaction with EGI • Long-term: privacy-preserving data mining Application Track - Grid Observatory
Data Collection and Publication • Publication service: navigation and querying • Integration of independent sources • Indexing along the needs of the users communities • Scheduling: ongoing work with CoreGrid • Jobs: ongoing work with KDUbik • Ontology • The Glue Information Model: an ontology of the resources • Concepts for the grid dynamics e.g. job lifecycle or users relations • Expert concepts as prior knowledge of non-trivial correlations: workflows, failure modes,… Job Resource Application Track - Grid Observatory
Models • Intrinsic characterizations of «grid traffic»: (distribution of) e.g. job arrival rate, running time, application data locality • Likely to be similar to IP traffic: many short, and a significant number of long, at all scales • Long range dependencies Application Track - Grid Observatory
Models • Intrinsic characterizations of «grid traffic»: (distribution of) e.g. job arrival rate, running time, application data locality • Likely to be similar to IP traffic: many short, and a significant number of long, at all scales • Long range dependencies • Characterizations of middleware-dependant metrics e.g. queuing delays, overhead, SE load Application Track - Grid Observatory
Models • Intrinsic characterizations of «grid traffic»: (distribution of) e.g. job arrival rate, running time, application data locality • Likely to be similar to IP traffic: many short, and a significant number of long, at all scales • Long range dependencies • Characterizations of middleware-dependant metrics e.g. queuing delays, SE load • Inferenceof models for middleware components and applications, users and usage profiles, users interactions Application Track - Grid Observatory
Autonomic dependability • On-line failure detectionand anticipation • Passive vs Active probing : a lot of information is available from user work • Black-box • On-line statistics from « similar » actions (executions, data access, middleware modules) Application Track - Grid Observatory
Evaluation • Assessing performance at the grid scale is a challenge • Need a snapshot of the inputs and grid state e.g. workload and available services during a relevant time range • Classical optimization does not scale • Advanced optimization: anytime algorithms Application Track - Grid Observatory
Abrupt changepoint detection • Page-Hinckley statistics • Time-sequential version of Wald’s statistics – also known as CUSUM • « intelligent threshold » test which minimizes the expected time before a change detection for a fixed false positive rate • Routine in quality control, clinical trials VO software bug Blackhole Application Track - Grid Observatory
Autonomic dependability • On-line failure detectionand anticipation • Passive vs Active probing : a lot of information is available from user work • Black-box • On-line statistics from « similar » actions (executions, data access, middleware modules) • Supervised and unsupervised learning Application Track - Grid Observatory
Mining the L&B logs Constructive induction Double clustering Application Track - Grid Observatory
Autonomic dependability • On-line failure detectionand anticipation • Passive vs Active probing : a lot of information is available from user work • Black-box • On-line statistics from « similar » actions (executions, data access, middleware modules) • Supervised and unsupervised learning • Active probing • Adaptive on-line test selection for best coverage of possibly faulty components • Experience planning Application Track - Grid Observatory
Goals & Challenges • Contributions to a quantitative approach of grid middleware and architecture, in the RISC sense • Operational impacts on EGEE: evaluation, autonomic dependability • Basic research in autonomic computing • Collaboration between EGEE and national research initiatives and other UE projects: DEMAIN, PASCAL KD-Ubiq, CoreGrid, and hopefully more • Adequate tradeoff between productivity and sustainability Application Track - Grid Observatory