290 likes | 387 Views
The Way Things Go. e-Science is a complex activity Scientific knowledge is comprehensible only in the context of those activities Adopt the Rube Goldberg view. Rube Goldberg. Grand challenge: systems-scale science. Observation and modeling of multiple systems at multiple scales
E N D
The Way Things Go • e-Science is a complex activity • Scientific knowledge is comprehensible only in the context of those activities • Adopt the Rube Goldberg view Rube Goldberg
Grand challenge: systems-scale science • Observation and modeling of multiple systems at multiple scales • Linking data and tools from different disciplines • to get a valid global result! “... modeling complex systems will be a major research challenge for the 21st century” - National Science Foundation
Building current practices up isn't working • Heterogeneous tools, data formats • Little global coordination of research • Little funding for sustained stewardship of tools and data M.C. Escher, “Tower of Babel” (1928)
Proposed solutions aren't working • e-Journals – not machine-interpretable • Collaboration tools • scientists just use email like everyone else • Portals and digital libraries – typically: • centralized • domain-specific • The Grid – can orchestrate complex processing jobs, but that's not science
Only networks work at scale • Single researcher • Ad hoc data mgt, single-user apps • Community • Community tools, resources, control • Global • No global practice, tools, control Desktop Workgroup Network
How do we get there? • e-Science means managing • Process, and • Data • Current approaches favor one or the other • Information is getting lost model refine predict observe critical interface data
Trends: process data process Workflow * provenance * the grid * portals Interactive * e-notebooks * desktop apps * digital libraries * rules * formats * ontologies Batch * mainframes data Data Metadata Semantics
Key technologies • Semantic web: data/metadata • Provides means of merging descriptive information even if it only partially agrees (e.g., comes from two different communities) • Workflow: process • Describes complex procedures independently of how they are executed • Provenance: process + data/metadata • Links workflow, data, and any ancillary descriptive information (e.g., attribution)
Semantics: data to knowledge Knowledge Ontologies, rules, models, etc. (a.k.a. semantics) Abstract Learning, inference Information Collections, tags, attributes, etc. (a.k.a. metadata) Aggregation, annotation Data Streams, arrays, swaths, etc. (a.k.a. files) Concrete (cf Reagan Moore)
Semantic web: RDF triple subject predicate object • Declarative: asserts a fact • Subject and object URI's identify arbitrary entities (things, people, concepts, events) • Predicate identifies the relationship between them
Triples form an open network • Subject nodes aren't “owned” by any single agent or container • Any actor can add arcs to the implicit, total, world graph • Any two graphs can be joined hasBreed
Non satis non scire(to know is not enough) • Semantic web “layer cake” • Where do we manage process? • User interface? • Applications? • “Semantic Grid” (D. DeRoure, C. Goble) (source: World Wide Web Consortium)
Workflow: process description • Describe complex operations as networks of simpler operations • Abstract operation execution from description • Can be shared (but may not be portable) (Taverna) (Kepler)
Anatomy of a workflow • Declarative: says what do to • Modules identify arbitrary procedures • Arcs identify flow of control and/or data (data flow is usually implicit) Execution model (usu. implicit) “Module” Control flow
Workflow systems • Modules representing units of computation • Language for specifying WF • modules • control flow • Engine for executing WF D2K (source: NCSA)
Work vs. workflow systems • Scientists are not WF modules • Science work also involves • social organization incl. funding • field and “wet lab” manual work • discourse: review, validation (source: CNRS/UCSD)
Provenance: what happened • Answers critical questions • What led to this result? • When and how were observations made, conclusions reached? • Is a causal network of events
Process-centric (e.g., workflow) computational events (e.g., service invocations) control flow artifacts are either not mentioned or opaque (tool-specific) Complementary incomplete notions of provenance • Artifact-centric (e.g., digital libraries) • “lineage”= events in lifecycle of artifact e.g., custody • IR's focus on curation events (not antecedent processes)
Provenance Challenges 1 & 2 • IPAW 2006, HPDC 2007 • 20 teams, 1 workflow, 9 queries • major players • Interoperability? • lots of manual work required • call for standards (source: gridprovenance.org)
Artifact + process provenance = “open provenance” • Can describe any process, not just WF execution (e.g., science!) • Allows alternate accounts by different observers • Rules for inferring transitive causal relationships (source: Luc Moreau et al)
Open Provenance Model (source: Luc Moreau et al) • 3 node types – artifact, process, agent • 5 arc types – used, generated, triggered, derived, controlled – and inference rules • Generic – extensibility via annotation • Choice of granularity and focus (e.g., artifact or process-centric)
NCSA Provenance Infrastructure Visualization, interaction destkop, portal, etc. Tracking, modeling, presentation OPM toolkit OPM toolkit Open Provenance Model Tupelo Semantic Content Repository Context Context Context Abstraction, inference, storage Store Store Store
Tupelo: semantic content • Abstracts content from storage impls (e.g., Sesame, Mulgara) • Provides location-independent addressing of content and metadata • Supports transparent mirroring, caching, failover, etc. (tupeloproject.org)
CyberIntegrator: workflow by example • Records what users do as provenance • source, intermediate, and final artifacts • steps and parameters • Can re-enact interaction as a workflow
MAEviz: analaysis/viz app, workflow “behind the scenes” • GIS app. platform • Earthquake hazard analysis plug-in • Data catalog • built environment • fragility/hazard models • Driven by workflow -> provenance
CyberCollaboratory: collaboration + provenance • User interaction with tools generates events • Events are captured using the OPM and published to Tupelo • Non-portal apps can browse / use provenance
Summary • “The way things go” is critical to e-Science at scale • Provenance is an open causal network • New infrastructure supports provenance
Resources / acknowledgements • Grid Provenance Challenge • http://twiki.gridprovenance.org/ • NCSA technologies • Tupelo: http://tupeloproject.org/ • CyberIntegrator: http://isda.ncsa.uiuc.edu/ • MAEviz: http://maeviz.cee.uiuc.edu/ • CyberCollaboratory: http://ecid.ncsa.uiuc.edu/cybercollab/ • Acknowledgements: • Jim Myers, Luc Moreau, Juliana Friere, Patrick Paulson, Simon Miles, Bob McGrath, and more ...