340 likes | 494 Views
Putting Lipstick on Pig:. Enabling Database-styleWorkflow Provenance. YaelAmsterdamer,Susan B.Davidson,Daniel Deutch Tova Milo,Julia Stoyanovich,ValTannen. Using slides by Guozhang Wang. Provenance in context of Workflows .
E N D
PuttingLipstickonPig: EnablingDatabase-styleWorkflow Provenance YaelAmsterdamer,SusanB.Davidson,DanielDeutch TovaMilo,JuliaStoyanovich,ValTannen Using slides by Guozhang Wang
Provenance in context of Workflows • Data-Intensive complex computational processes often generate many final and intermediate data products • Scientists and engineers need to expend substantial effort managing data and recording provenance information so that basic questions can be answered.
Provenance in context of Workflows • Who created this data product and when? • When was it modified and by whom? • What was the process used to create the • data product? • Were two data products derived from the same raw data?
Provenance in context of Workflows • an essential component to allow for • result reproducibility • verifiability • sharing and knowledge re-use in the scientific community
WorkflowProvenance MotivatedbyScientificWorkflows ◦Community:IPAW ◦Interests:process documentation,data derivationand annotation,etc ◦Model:OPM
OPMModel Annotateddirectedacyclicgraph ◦Artifact:immutablepieceofstate ◦Process:actionsperformedonartifacts,result innewartifacts ◦Agents:executeandcontrolprocesses Aimstocapturecausaldependencies betweenagents/processes Eachprocessistreatedasa“black-box”
DataProvenance (forRelationalDBandXML) MotivatedbyProb.DB,datawarehousing.. ◦Community: SIGMOD/PODS ◦Interests:data auditing,datasharing, etc ◦Model:Semiring(etc)
Semiring K-relations ◦Eachtupleisuniquelylabeledwitha provenance“token” Operations: ◦•:join ◦+:projection ◦0and1:selectionpredicates
Workflow Provenance Researchers DataProvenance Researchers
OPM’sDrawbacksinSemiring People’sEyes Theblack-boxassumption:eachoutputof themoduledependssolelyonallits inputs ◦Cannotleveragethecommonfactthatsome outputonlydependsonsmallsubsetofinputs ◦Doesnotcaptureinternalstateofamodule So:replaceitwithSemirings!
TheIdea Generalworkflowmodulesare complicated,andthushardtocaptureits internallogicbyannotations However,moduleswritteninPigLatinis verysimilartoNestedRelationalCalculus (NRC),thusaremuchmorefeasible
PigLatin Data:unordered(nested)bagoftuples Operators: ◦FOREACHtGENERATEf1,f2,…OP(f0) ◦FILTERBYcondition ◦GROUP/COGROUP ◦UNION,JOIN,FLATTEN,DISTINCT…
ProvenanceAnnotation1.1 Provenancenodeandvaluenodes ◦Workflowinputnodes ◦Moduleinvocationnodes ◦Moduleinput/outputnodes
ProvenanceAnnotationI.2 Statenodes ◦P-nodeforthetuple ◦P-nodeforthestate
ProvenanceAnnotation2.1 FOREACH(projection,noOP) ◦P-nodewith“+”
ProvenanceAnnotation2.2 JOIN ◦P-nodewith“*”
ProvenanceAnnotation2.3 GROUP ◦P-nodewith“∂”
ProvenanceAnnotation2.4 FOREACH(aggregation,OP) ◦V-nodewiththeOPname
ProvenanceAnnotation2.5 COGROUP ◦P-nodewith“∂”
ProvenanceAnnotation2.6 FOREACH(UDFBlackBox) ◦P-node/V-nodewiththeUDFname
QueryProvenanceGraph Zoom-Inv.s.Zoom-Out Coarse-grained Fine-grained
QueryProvenanceGraph DeletionPropagation ◦DeletethetupleP-nodeanditsout-edges ◦RepeateddeleteP-nodesif Allitsin-edgesaredeleted Ithaslabel•andoneofitsin-edgesisdeleted
ImplementationandExperiments Lipstickprototype ◦ProvenanceannotationcodedinPigLatin, withthegraphwrittentofiles ◦QueryprocessingcodedinJavaandrunsin memory. Benchmarkdata ◦Cardealership:fixedworkflowand#dealers ◦ArcticStation:Variedworkflowstructureand size
AnnotationOverhead Overheadincreaseswithexecutiontime
AnnotationOverhead Parallelismhelpswithupto#modules
LoadingGraphOverhead Increasewithgraphsize (comp.time<8sec)
LoadingGraphOverhead Feasiblewithvarioussizes (comp.time<3sec)
SubgraphQueryTime Queryefficientlywithsub-secondtime
Conclusions ThankYou! Studied fine-grained provenance for workflows Individual modules implemented in Pig Latin Provenance model for Pig Latin queries DataprovenanceideassuchasSemirings canbebroughttoworkflowprovenance