1 / 50

The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration

The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration. Summer Grid 2004 UT Brownsville South Padre Island Center 24 June 2004 Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division. GriPhyN: Grid Physics Network Mission.

sumana
Download Presentation

The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Virtual Data Grid:A New Model and Architecture forData-Intensive Collaboration Summer Grid 2004 UT Brownsville South Padre Island Center 24 June 2004 Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division

  2. GriPhyN:Grid Physics Network Mission Enhance scientific productivity through discovery and processing of datasets, using the grid as a scientific workstation Virtual Data enables this approach by creating datasets from workflow “recipes” and recording their provenance. GriPhyN works to “cross the chasm” - application and computer scientists create and field-test paradigms and toolkits together

  3. Acknowledgements:Virtual Data is a Large Team Effort The Chimera Virtual Data Systemis the work of Ian Foster, Jens Voeckler, Mike Wilde and Yong Zhao The Pegasus Planner is the work of Ewa Deelman, Gaurang Mehta, and Karan Vahi Applications described are the work of many people, including: James Annis, Rick Cavanaugh, Dan Engh, Rob Gardner, Albert Lazzarini, Natalia Maltsev, Marge Bardeen, and their wonderful teams

  4. psearch –t 10 … file1 file8 simulate –t 10 … file1 file1 File3,4,5 file2 reformat –f fz … file7 conv –I esd –o aod summarize –t 10 … file6 Virtual Data Scenario Manage workflow; Update workflow following changes On-demand data generation Explain provenance, e.g. for file8: • psearch –t 10 –i file3 file4 file5 –o file8summarize –t 10 –i file6 –o file7reformat –f fz –i file2 –o file3 file4 file5 conv –l esd –o aod –i file 2 –o file6simulate –t 10 –o file1 file2

  5. file1 file1 File3,4,5 Virtual DataDescribes analysis workflow psearch –t 10 … file1 file8 • The recorded virtual data “recipe” here is: • Files: 8 < (1,3,4,5,7), 7 < 6, (3,4,5,6) < 2 • Programs: 8 < psearch, 7 < summarize,(3,4,5) < reformat, 6 < conv, (1,2) < simulate simulate –t 10 … file2 reformat –f fz … Requesteddataset file7 conv –I esd –o aod summarize –t 10 … file6

  6. file1 file1 File3,4,5 Virtual DataDescribes analysis workflow psearch –t 10 … file1 file8 • To recreate file 8: Step 1 • simulate > file1, file2 simulate –t 10 … file2 reformat –f fz … Requestedfile file7 conv –I esd –o aod summarize –t 10 … file6

  7. file1 file1 File3,4,5 Virtual DataDescribes analysis workflow psearch –t 10 … file1 file8 • To re-create file8: Step 2 • files 3, 4, 5, 6 derived from file 2 • reformat > file3, file4, file5 • conv > file 6 simulate –t 10 … file2 reformat –f fz … Requestedfile file7 conv –I esd –o aod summarize –t 10 … file6

  8. file1 file1 File3,4,5 Virtual DataDescribes analysis workflow psearch –t 10 … file1 file8 • To re-create file 8: step 3 • File 7 depends on file 6 • Summarize > file 7 simulate –t 10 … file2 reformat –f fz … Requestedfile file7 conv –I esd –o aod summarize –t 10 … file6

  9. file1 file1 file1 File3,4,5 file2 reformat –f fz … conv –I esd –o aod file6 Virtual DataDescribes analysis workflow psearch –t 10 … file8 • To re-create file 8: final step • File 8 depends on files 1, 3, 4, 5, 7 • psearch < file1, file3, file4, file5, file 7 > file 8 simulate –t 10 … Requestedfile file7 summarize –t 10 …

  10. Grid3 – The Laboratory Supported by the National Science Foundation and the Department of Energy.

  11. VDL: Virtual Data LanguageDescribes Data Transformations • Transformation • Abstract template of program invocation • Similar to "function definition" • Derivation • “Function call” to a Transformation • Store past and future: • A record of how data products were generated • A recipe of how data products can be generated • Invocation • Record of a Derivation execution • These XML documents reside in a “virtual data catalog” – VDC - a relational database

  12. VDL Describes Workflowvia Data Dependencies file1 TR tr1(in a1, out a2) { argument stdin = ${a1};  argument stdout = ${a2}; } TR tr2(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV x1->tr1(a1=@{in:file1}, a2=@{out:file2}); DV x2->tr2(a1=@{in:file2}, a2=@{out:file3}); x1 file2 x2 file3

  13. Workflow example • Graph structure • Fan-in • Fan-out • "left" and "right" can run in parallel • Needs external input file • Located via replica catalog • Data file dependencies • Form graph structure preprocess findrange findrange analyze

  14. Complete VDL workflow • Generate appropriate derivations DV top->preprocess( b=[ @{out:"f.b1"}, @{ out:"f.b2"} ], a=@{in:"f.a"} ); DV left->findrange( b=@{out:"f.c1"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="left", p="0.5" ); DV right->findrange( b=@{out:"f.c2"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="right" ); DV bottom->analyze( b=@{out:"f.d"}, a=[ @{in:"f.c1"}, @{in:"f.c2"} );

  15. Compound TransformationsEnable Functional Abstractions • Compound TR encapsulates an entire sub-graph: TR rangeAnalysis (in fa, p1, p2, out fd, io fc1, io fc2, io fb1, io fb2, ) { call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ] ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="LEFT", p=${p1}, b=${out:fc1} ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="RIGHT", p=${p2}, b=${out:fc2} ); call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} ); }

  16. Derivation scripts • Representation of virtual data provenance: DV d1->diamond( fd=@{out:"f.00005"}, fc1=@{io:"f.00004"}, fc2=@{io:"f.00003"}, fb1=@{io:"f.00002"}, fb2=@{io:"f.00001"}, fa=@{io:"f.00000"}, p2="100", p1="0" ); DV d2->diamond( fd=@{out:"f.0000B"}, fc1=@{io:"f.0000A"}, fc2=@{io:"f.00009"}, fb1=@{io:"f.00008"}, fb2=@{io:"f.00007"}, fa=@{io:"f.00006"}, p2="141.42135623731", p1="0" ); ... DV d70->diamond( fd=@{out:"f.001A3"}, fc1=@{io:"f.001A2"}, fc2=@{io:"f.001A1"}, fb1=@{io:"f.001A0"}, fb2=@{io:"f.0019F"}, fa=@{io:"f.0019E"}, p2="800", p1="18" );

  17. Invocation Provenance Completion status and resource usage Attributes of executable transformation Attributes of input and output files

  18. Executing VDL Workflows Grid Info Global planner “Pegasus” Concrete DAG Abstract workflow “jit” planner (research) DAGman / Condor-G local planner

  19. GriPhyN-iVDGLApplications to date • ATLAS, BTeV, CMS – HEP event simulation • Argonne Computational Biology – sequence comparison and result capture • LIGO – Pulsar search • Sloan Digital Sky Survey – cluster finding; near-earth object search planned • Quarknet – science education – cosmic rays, HEP analysis

  20. Genome Analysis Database Update Application work by Alex Rodriguez, Dina Sulakhe, Natalia Matlsev,Argonne MCS Described in GGF10workshop paper.

  21. Virtual Data Example:Galaxy Cluster Search DAG Sloan Data Galaxy cluster size distribution Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, University of Chicago. Described in SC2002 paper

  22. Cluster SearchWorkflow Graphand Execution Trace Workflow jobs vs time

  23. mass = 200 decay = bb mass = 200 mass = 200 decay = ZZ mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = WW stability = 1 mass = 200 event = 8 mass = 200 decay = WW stability = 1 event = 8 mass = 200 plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 Virtual Data Application: High Energy Physics Data Analysis mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000 Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida Ref: CHEP 2002 paper

  24. Using Virtual Data forScience Education • The QuarkNet-Trillium collaboration is using Grid virtual data tools and methods to enrich science education • Its an experiment to give students the means to: • discover and apply datasets, algorithms, and data analysis methods • collaborate by developing new ones and sharing results and observations • learn data analysis methods that will ready and excite them for a scientific career • And in later steps, we may actually use the Grid!

  25. Student/TeacherTeams Student/TeacherTeams Student/TeacherTeams Quarknet Virtual Data Project Quarknet Virtual Data Portal Central High SchoolReston, Virginia Cosmic Ray Detector Locally Collected Data Student Data,Algorithms,Results, Notes, and communications Foothills High SchoolGreat Falls, Montana VirtualData Toolkit CosmicRayDetector Standard Web access LocallyCollected Data Virtual Data Catalog Yale / Middletown High CollaborationHartford, Connecticut CosmicRayDetector LocallyCollected Data Student teacher teams sharing data, methods, programs, and knowledge Enabling collaboration-intensive science discovery with virtual data tools and methods

  26. Detector Performance Study

  27. Example: BTeV Event Simulation

  28. Support for Search and Discovery • Goal: make it as easy to use as Google • More advanced capabilities lie below the surface (as with Google) • Understand the structure and meaning of the datasets and their fields. • Advanced search, using SQL-like queries • Find both DATA and TRANSFORMATIONS • Create datasets from queries • Perform calculations on datasets, filtering results to look for patterns

  29. Search byMetadata

  30. Derving a new dataset…to find mass of “z” particle:

  31. Workflow formissing energy calculations

  32. Virtual Provenance:list of derivations and files <job id="ID000001" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="5“ dv-namespace="Quarknet.HEPSRCH" dv-name="run1aesum"> <argument><filename file="run1a.event"/> <filename file="run1a.esm"/></argument> <uses file="run1a.esm" link="output" dontRegister="false" dontTransfer="false"/> <uses file="run1a.event" link="input" dontRegister="false" dontTransfer="false"/> </job> <job id="ID000002" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="7“ dv-namespace="Quarknet.HEPSRCH" … <argument><filename file="electron10GeV.event"/> <filenamefile="electron10GeV.sum"/></argument>… </job> <job id="ID000014" namespace="Quarknet.HEPSRCH" name="ReconTotalEnergy" level="3"… <argument><filename file="run1a.mis"/> <filename file="run1a.ecal"/> … <uses file="run1a.muon" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.total" link="output" dontRegister="false" dontTransfer="false"/> <uses file="run1a.ecal" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.hcal" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.mis" link="input" dontRegister="false" dontTransfer="false"/> </job> <!--list of all files used --> <filename file="ecal.pct" link="inout"/> <filename file="electron10GeV.avg" link="inout"/> <filename file="electron10GeV.sum" link="inout"/> <filename file="hcal.pct" link="inout"/>…. (excerpted for display)

  33. Virtual Provenance in XML:control flow graph <child ref="ID000003"> <parent ref="ID000002"/> </child> <child ref="ID000004"> <parent ref="ID000003"/> </child> <child ref="ID000005"> <parent ref="ID000004"/> <parent ref="ID000001"/>… <child ref="ID000009"> <parent ref="ID000008"/> </child> <child ref="ID000010"> <parent ref="ID000009"/> <parent ref="ID000006"/>… <child ref="ID000012"> <parent ref="ID000011"/> </child> <child ref="ID000013"> <parent ref="ID000011"/> </child> <child ref="ID000014"> <parent ref="ID000010"/> <parent ref="ID000012"/>… <parent ref="ID000013"/>… </child>… (excerpted for display…)

  34. And writing the results up in a “poster”

  35. Poster describing analysis

  36. Using active data from Web Services

  37. Levels of Interaction • “Skins” – use it like a calculator, experiment with scenarios and settings, use virtual data like a log book to document, assess, and share parameter values. • “Blocks” – re-assemble workflow pipelines using existing ones as patterns and pre-developed transforms as building blocks • “Code” – write new transforms in a variety of languages and data models

  38. Observations • A provenance approach based on interface definition and data flow declaration fits well with Grid requirements for code and data transportability and heterogeneity • Working in a provenance-managed system has many fringe benefits: uniformity, precision, structure, communication, documentation • The real world is messy – finding the right abstractions is hard, and handling “legacy” applications is even harder

  39. Vision for Provenance in the Large • Universal knowledge management and production systems • Vendors integrate the provenance tracking protocol into data processing products • Ability to run anywhere “in the Grid”

  40. Virtual Data Grid Vision

  41. Planned Dataset Model <FORM <Title…> /FORM> File Set of files Object closure XML Element Relational query or spreadsheet range New user-defined dataset type: Set of files with relational index Speculative model described in CIDR 2003 paper by Foster, Voeckler, Wilde and Zhao

  42. Planned Dataset Type Model FileDataset Representational File FileSet Logical MultiFileSet TarFileSet EventCollection (Nonleaf Typesare Superclasses) RawEventSet SimulatedEventSet MonteCarloSimulation DiscreteEventSimulation

  43. Provenance Server Plans • OGSA-based Grid services • Discovery, security, resource management • Supports code and data discoveryand workflow management • Object names (TR, DS, TY, DV, IV) can be used as global cross-server links • Derivations can reference remote transformations and datasets • Structured object namespaces & object-level access control enable large VO collaboration • Generalize transforms to describe service calls, database queries and language interpreters

  44. Provenance Hyperlinks

  45. Indexing Serversto Support Discovery

  46. For Information and Software • Virtual Data System • www.griphyn.org/chimera - Chimera Virtual Data System: Overview, papers, software • Grids and Grid Software • www.ivdgl.org/grid2003 - Using Grid3 • www.griphyn.org/vdt - Virtual Data Toolkit • www.globus.org – The Globus Toolkit • www.cs.wisc.edu/condor - The Condor Project • www.ppdg.net – Particle Physics Data Grid

  47. Acknowledgements GriPhyN, iVDGL, and QuarkNet(in part) are supported by the National Science Foundation The Globus Alliance, PPDG, and QuarkNet are supported in part by the US Department of Energy, Office of Science; by the NASA Information Power Grid program; and by IBM

More Related