430 likes | 601 Views
Provenance (for Earth science data). DKRZ-Seminar, Oct 15 2012. Agenda. What is provenance? Why do we care? Alternative Provenance definitions Gathering and representing provenance information Further resources. What is provenance ?. And what does it mean in our context ?.
E N D
Provenance(for Earth science data) DKRZ-Seminar, Oct 15 2012
Agenda DKRZ seminar: Provenance What is provenance? Why do we care? Alternative Provenance definitions Gathering and representing provenance information Further resources
Whatisprovenance? Andwhatdoesitmeanin ourcontext? DKRZ seminar: Provenance
Provenance: Definition DKRZ seminar: Provenance • For data produced by computer systems: • „The provenance of a piece of data is the process that led to that piece of data.“ (Moreau 2010) • This is a generic base definition. • Other terms: data lineage, history • (provenance also applies to food ingredients, works of art, ...)
Ourcontext… DKRZ seminar: Provenance • Whatisourcontext? • digital-born ESM outputdata • observationaldata, e.g. remote sensingimagery • variousprocessedderivates • Whatareitscharacteristics? • complex, non-standardizedtoolchain • variousprocessingstepsbyvariousactors • nosingleinfrastructure
Use Cases (1) DKRZ seminar: Provenance • Quality ofscientificdata • The processinghistoryof a dataobjectforms an importantpartofitsscientificcontext. • Users whodid not create a dataproduct must beabletounderstandtheimplicationsthatwentintoitscreation. • Data maybereusedmanyyears after creation.
Use Cases (2) DKRZ seminar: Provenance • Reproducibility • Ifprocessingstepsarerecorded in detail, a futureusermayreproducethemtogettheexact same results • May beimpossiblefor ESM outputdatain all itsdepth • Wecan‘tarchivethesupercomputeritself • Yet, trytocaptureasmuchaspossible
Use Cases (3) DKRZ seminar: Provenance • Attribution • Givecredittothe original dataproducer • Citing a DataCite DOI may not beenough • Who isusingdatathatisgeneratedwithDKRZ‘sresources? • Provenancecanenableanyonetotrace back tothe original sourceandproducer
Data-intensive science The FourthParadigm (2009) DKRZ seminar: Provenance • The scenariosgrowmoreimportantwithdata-intensive science • Data issharedacrossscientificcommunities • Focus shiftsfromdataproductiontodataanalysis
Provenanceandthedatalifecycle DKRZ seminar: Provenance • Provenancemaycoverthewholedatalifecycle • Here: focus on earlierparts • datageneration • dataprocessing
Alternative provenancedefinitions There‘smorethanone! DKRZ seminar: Provenance
The task DKRZ seminar: Provenance The taskhere: Develop an understandingofprovenancethatisspecificandpragmatic in ourcontext.
Provenancedefinitions Database context Moreau (2010) DKRZ seminar: Provenance Why-Provenance Where-Provenance Provenanceas a process Provenanceas a DirectedAcyclicGraph (therearemore…)
Why-Provenance Moreau (2010), Buneman et al. (2001) DKRZ seminar: Provenance • Contextofdatabasequeries • Why-Provenance: „tupleswhosepresencejustifies a queryresult“ • „Whyis X partoftheresult?“ • „Becausethequeriedinputdatacontainstuple A“
Where-Provenance Moreau (2010), Buneman et al. (2001) DKRZ seminar: Provenance • A Website displays a typo in a menuentry • Whatisthedatabasefieldthisstringcomesfrom? • This may not bethedatabasedirectlyconnectedtothewebsite, but e.g. a citationdatabasemaintainedelsewhereandqueriedbythesite • Helpstoilluminatethecopyingofinformationacrossdatabases.
Provenanceas a process (1) Moreau (2010) DKRZ seminar: Provenance • The computationthatresulted in thedata • Any • data • event • useraction • thatcanbeconnectedtothedatathrough a computationalprocesspotentiallybelongstoitsprovenance
Provenanceas a process (2) Moreau (2010) DKRZ seminar: Provenance • ESM execution: The contextcangetveryvast. • modelsourcecode • all parameters, modelconditions, forcings • username, libraries, OS version, parallel architecture • ...
Provenanceas a DAG DKRZ seminar: Provenance Whatis a DirectedAcyclic Graph? ... Whatis a graph?
Whatis not a graph? DKRZ seminar: Provenance These areno (mathematical) graphs.
Whatis a graph? Wikipedia DKRZ seminar: Provenance This is a graph. Graphs areeverywhere.
Whatis a graph? DKRZ seminar: Provenance • Graph theory: A graphconsistsof a setofnodes(vertices) anda setofedges • G = (V, E) • any e ∈ E is an unorderedset (v1, v2); v1, v2∈ V
Whatis a directedgraph? DKRZ seminar: Provenance • Directedgraph • directededges • set (v1, v2) isordered; theedgeisdirectedfrom v1to v2
Whatis a DirectedAcyclic Graph? DKRZ seminar: Provenance • DirectedAcyclic Graph (DAG) • directededges • nocyclesallowed!
Provenanceas a DirectedAcyclic Graph t cdo Moreau (2010) DKRZ seminar: Provenance • Simplified, data-centricview • Nodes representdataitems • Edgesrepresent derivative operations • „predecessor“, „successor“, „derived-from“, ... • uni- orbidirectional • Level ofdetaildepends on theusecases
Provenanceinformationispartofthemetadata DKRZ seminar: Provenance • Provenanceinformationispartofthemetadata • Curatingthismetadataistediousandpragmaticallyimpossible • Itisagreedthatprovenancegathering must beautomated • View a provenancerecordassomethingcreated on thefly, ratherthan a storeddocument • provenanceistheresultof a queryoverprocessassertions(Moreau 2010)
Applications Whatis out theretogather, represent, exploitprovenance? DKRZ seminar: Provenance
Gatheringprovenanceinformation DKRZ seminar: Provenance • Manytoolsexisttocaptureprovenancethrough an embracingsystem (particularlyworkflowsystems) • Lots ofresearchandacademicprototypes • A listisavailableat http://www.openprovenance.org • Provenanceinformationmaybeaggregated in a specificdatabase(provenancestore)
Gatheringprovenance: workflowsystems DKRZ seminar: Provenance • Scientific workflowsystems • e.g. Taverna, Kepler, VisTrails, ... • Advantages • Potentiallygoodcoverage • Improvescollaborationandknowledgetransfer • Disadvantages • all ornothing • high migrationcosts • do not matchuser‘s traditional workflow
Gatheringprovenanceinformation - alternative DKRZ seminar: Provenance Forus: nooverarchingsystempossible in themid-term Alternative idea: captureproveance in smallpiecesbyenhancingtheexistingtoolsoftheresearchenvironments
Gatheringprovenance: in smallsteps DKRZ seminar: Provenance • Advantages • resultsscalewellwithimplementationeffort • potentiallysmallchangetouserworkflow • Disadvantages • fragmentarycoverage • incoherent, potentiallychaoticinformation • mandatesstrictstandardization
Representationofprovenanceinformation DKRZ seminar: Provenance • Provenanceinformationcanberepresented in manyformats • human-interpretable: human-adressed log files, freetext • machine-interpretable: traversablegraphs • simple (A derivedfrom B) • complex (semanticgraph, Open Provenance Model)
Machine-interpretablerepresentationrequired? DKRZ seminar: Provenance • Machine-interpretablerepresentationofprovenance: whatisthedesiredlevelofdetail? • moredetailmoresophisticatedrepresentationlanguagerequired • So thecorequestionis: doesyourusecaserequiresophisticatedmachine-interpretablerepresentation? • remember: machine-interpretabilityisfortools, not forhumans
Representationformats DKRZ seminar: Provenance • Machine-interpretablerepresentationofgatheredinformation • Designedto span acrosssystems • Standardizedrepresentationformats • 2010: Open Provenance Model (OPM) • 2012: W3C PROV (Draftstatus) • The same setofpeopleareinvolved
OPM: Agents, ProcessesandArtifacts DKRZ seminar: Provenance OPM and W3C arebothgraph-basedrepresentations In thefollowing: The Open Provenance Model (OPM) in brief
Open Provenance Model: baseelements Moreau et al. (2010b) DKRZ seminar: Provenance • Agents • cdo, user • Processes • calculatemonthlymeans • Artefacts / Entities • inputandoutputdata, log file • These aremodelledin thepast.
Motivationsfor W3C PROV DKRZ seminar: Provenance W3C PROV continuestheworkof OPM Roughly: alignitto RDF/OWL andotherSemantic Web standards
Queryingandviewingprovenance DKRZ seminar: Provenance • Exploitencodedprovenanceinformation? • visualization • querying
Summary: whatisprovenance? DKRZ seminar: Provenance • Tosummarizethisparticularview: • Provenanceistheresultof a queryoverprocessassertions. • Such assertionscan in theirsimplest form berepresentedthrough an (evergrowing) DAG. • Includingmoredetailsrequires a processviewthatembraces a larger context. • Provenanceinformationissubjectto LTA
Bottom-upapproach DKRZ seminar: Provenance • Suggestion: Start smalland simple. • Collectsmallpiecesofinformation • automatically, infrastructuretask, do not burdendataproducer • Providetoolstogatherintelligencefromthisheapofinformation • DAG-view hasobvious simple queryingmodel (tree) andis easy tounderstandandexplain • Buildthe DAG as a baselayer, thenattachrichercontexttothenodesoredges
Andthensome... DKRZ seminar: Provenance • Construct a provenancegraphusing Persistent Identifiers? • PhDtopic • DKRZ-Seminar on Persistent Identifiers • Wednesday, 17 Oct • 14-16h • Same place (R34)
Further reading DKRZ seminar: Provenance • Luc Moreau: The FoundationsforProvenance on the Web (2010) • maininfluenceis Web science • summarizestheresearchfieldverywell • includes an extensive bibliography • OPM specification: http://www.openprovenance.org • W3C PROV:http://www.w3.org/TR/prov-primer/
The End. Thankyouforyourattention. DKRZ seminar: Provenance
References DKRZ seminar: Provenance • Moreau (2010): The FoundationsforProvenance on theWeb, doi:10.1561/1800000010 • pre-print: http://eprints.soton.ac.uk/268176/ • Moreau et al. (2010b): The Open ProvenanceModel Core Specification(v1.1), doi:10.1016/j.future.2010.07.005 • The FourthParadigm, 2009, Microsoft Research • Buneman et al. (2001): WhyandWhere: A characterizationof Data Provenance, doi:10.1007/3-540-44503-X_20