1 / 34

Putting Lipstick on Pig:

Putting Lipstick on Pig:. Enabling Database-styleWorkflow Provenance. YaelAmsterdamer,Susan B.Davidson,Daniel Deutch Tova Milo,Julia Stoyanovich,ValTannen. Using slides by Guozhang Wang. Provenance in context of Workflows .

dori
Download Presentation

Putting Lipstick on Pig:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PuttingLipstickonPig: EnablingDatabase-styleWorkflow Provenance YaelAmsterdamer,SusanB.Davidson,DanielDeutch TovaMilo,JuliaStoyanovich,ValTannen Using slides by Guozhang Wang

  2. Provenance in context of Workflows • Data-Intensive complex computational processes often generate many final and intermediate data products • Scientists and engineers need to expend substantial effort managing data and recording provenance information so that basic questions can be answered.

  3. Provenance in context of Workflows • Who created this data product and when? • When was it modified and by whom? • What was the process used to create the • data product? • Were two data products derived from the same raw data?

  4. Provenance in context of Workflows • an essential component to allow for • result reproducibility • verifiability • sharing and knowledge re-use in the scientific community

  5. WorkflowProvenance  MotivatedbyScientificWorkflows ◦Community:IPAW ◦Interests:process documentation,data derivationand annotation,etc ◦Model:OPM

  6. OPMModel Annotateddirectedacyclicgraph ◦Artifact:immutablepieceofstate ◦Process:actionsperformedonartifacts,result innewartifacts ◦Agents:executeandcontrolprocesses Aimstocapturecausaldependencies betweenagents/processes Eachprocessistreatedasa“black-box”   

  7. Example:CarDealership

  8. DataProvenance (forRelationalDBandXML)  MotivatedbyProb.DB,datawarehousing.. ◦Community: SIGMOD/PODS ◦Interests:data auditing,datasharing, etc ◦Model:Semiring(etc)

  9. Semiring K-relations ◦Eachtupleisuniquelylabeledwitha provenance“token” Operations: ◦•:join ◦+:projection ◦0and1:selectionpredicates  

  10. Workflow Provenance Researchers DataProvenance Researchers

  11. SemiringComestoMeetOPM

  12. OPM’sDrawbacksinSemiring People’sEyes Theblack-boxassumption:eachoutputof themoduledependssolelyonallits inputs ◦Cannotleveragethecommonfactthatsome outputonlydependsonsmallsubsetofinputs ◦Doesnotcaptureinternalstateofamodule So:replaceitwithSemirings!  

  13. TheIdea Generalworkflowmodulesare complicated,andthushardtocaptureits internallogicbyannotations However,moduleswritteninPigLatinis verysimilartoNestedRelationalCalculus (NRC),thusaremuchmorefeasible  

  14. PigLatin   Data:unordered(nested)bagoftuples Operators: ◦FOREACHtGENERATEf1,f2,…OP(f0) ◦FILTERBYcondition ◦GROUP/COGROUP ◦UNION,JOIN,FLATTEN,DISTINCT…

  15. Example:CarDealership

  16. BidRequestHandlinginPigLatin

  17. ProvenanceAnnotation

  18. ProvenanceAnnotation1.1 Provenancenodeandvaluenodes ◦Workflowinputnodes ◦Moduleinvocationnodes ◦Moduleinput/outputnodes 

  19. ProvenanceAnnotationI.2 Statenodes ◦P-nodeforthetuple ◦P-nodeforthestate 

  20. ProvenanceAnnotation2.1 FOREACH(projection,noOP) ◦P-nodewith“+” 

  21. ProvenanceAnnotation2.2 JOIN ◦P-nodewith“*” 

  22. ProvenanceAnnotation2.3 GROUP ◦P-nodewith“∂” 

  23. ProvenanceAnnotation2.4 FOREACH(aggregation,OP) ◦V-nodewiththeOPname 

  24. ProvenanceAnnotation2.5 COGROUP ◦P-nodewith“∂” 

  25. ProvenanceAnnotation2.6 FOREACH(UDFBlackBox) ◦P-node/V-nodewiththeUDFname 

  26. QueryProvenanceGraph Zoom-Inv.s.Zoom-Out  Coarse-grained Fine-grained

  27. QueryProvenanceGraph DeletionPropagation ◦DeletethetupleP-nodeanditsout-edges ◦RepeateddeleteP-nodesif Allitsin-edgesaredeleted Ithaslabel•andoneofitsin-edgesisdeleted 

  28. ImplementationandExperiments Lipstickprototype ◦ProvenanceannotationcodedinPigLatin, withthegraphwrittentofiles ◦QueryprocessingcodedinJavaandrunsin memory. Benchmarkdata ◦Cardealership:fixedworkflowand#dealers ◦ArcticStation:Variedworkflowstructureand size  

  29. AnnotationOverhead Overheadincreaseswithexecutiontime 

  30. AnnotationOverhead Parallelismhelpswithupto#modules 

  31. LoadingGraphOverhead Increasewithgraphsize (comp.time<8sec) 

  32. LoadingGraphOverhead Feasiblewithvarioussizes (comp.time<3sec) 

  33. SubgraphQueryTime Queryefficientlywithsub-secondtime 

  34. Conclusions ThankYou! Studied fine-grained provenance for workflows Individual modules implemented in Pig Latin Provenance model for Pig Latin queries DataprovenanceideassuchasSemirings canbebroughttoworkflowprovenance

More Related