500 likes | 623 Views
The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration. Summer Grid 2004 UT Brownsville South Padre Island Center 24 June 2004 Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division. GriPhyN: Grid Physics Network Mission.
E N D
The Virtual Data Grid:A New Model and Architecture forData-Intensive Collaboration Summer Grid 2004 UT Brownsville South Padre Island Center 24 June 2004 Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division
GriPhyN:Grid Physics Network Mission Enhance scientific productivity through discovery and processing of datasets, using the grid as a scientific workstation Virtual Data enables this approach by creating datasets from workflow “recipes” and recording their provenance. GriPhyN works to “cross the chasm” - application and computer scientists create and field-test paradigms and toolkits together
Acknowledgements:Virtual Data is a Large Team Effort The Chimera Virtual Data Systemis the work of Ian Foster, Jens Voeckler, Mike Wilde and Yong Zhao The Pegasus Planner is the work of Ewa Deelman, Gaurang Mehta, and Karan Vahi Applications described are the work of many people, including: James Annis, Rick Cavanaugh, Dan Engh, Rob Gardner, Albert Lazzarini, Natalia Maltsev, Marge Bardeen, and their wonderful teams
psearch –t 10 … file1 file8 simulate –t 10 … file1 file1 File3,4,5 file2 reformat –f fz … file7 conv –I esd –o aod summarize –t 10 … file6 Virtual Data Scenario Manage workflow; Update workflow following changes On-demand data generation Explain provenance, e.g. for file8: • psearch –t 10 –i file3 file4 file5 –o file8summarize –t 10 –i file6 –o file7reformat –f fz –i file2 –o file3 file4 file5 conv –l esd –o aod –i file 2 –o file6simulate –t 10 –o file1 file2
file1 file1 File3,4,5 Virtual DataDescribes analysis workflow psearch –t 10 … file1 file8 • The recorded virtual data “recipe” here is: • Files: 8 < (1,3,4,5,7), 7 < 6, (3,4,5,6) < 2 • Programs: 8 < psearch, 7 < summarize,(3,4,5) < reformat, 6 < conv, (1,2) < simulate simulate –t 10 … file2 reformat –f fz … Requesteddataset file7 conv –I esd –o aod summarize –t 10 … file6
file1 file1 File3,4,5 Virtual DataDescribes analysis workflow psearch –t 10 … file1 file8 • To recreate file 8: Step 1 • simulate > file1, file2 simulate –t 10 … file2 reformat –f fz … Requestedfile file7 conv –I esd –o aod summarize –t 10 … file6
file1 file1 File3,4,5 Virtual DataDescribes analysis workflow psearch –t 10 … file1 file8 • To re-create file8: Step 2 • files 3, 4, 5, 6 derived from file 2 • reformat > file3, file4, file5 • conv > file 6 simulate –t 10 … file2 reformat –f fz … Requestedfile file7 conv –I esd –o aod summarize –t 10 … file6
file1 file1 File3,4,5 Virtual DataDescribes analysis workflow psearch –t 10 … file1 file8 • To re-create file 8: step 3 • File 7 depends on file 6 • Summarize > file 7 simulate –t 10 … file2 reformat –f fz … Requestedfile file7 conv –I esd –o aod summarize –t 10 … file6
file1 file1 file1 File3,4,5 file2 reformat –f fz … conv –I esd –o aod file6 Virtual DataDescribes analysis workflow psearch –t 10 … file8 • To re-create file 8: final step • File 8 depends on files 1, 3, 4, 5, 7 • psearch < file1, file3, file4, file5, file 7 > file 8 simulate –t 10 … Requestedfile file7 summarize –t 10 …
Grid3 – The Laboratory Supported by the National Science Foundation and the Department of Energy.
VDL: Virtual Data LanguageDescribes Data Transformations • Transformation • Abstract template of program invocation • Similar to "function definition" • Derivation • “Function call” to a Transformation • Store past and future: • A record of how data products were generated • A recipe of how data products can be generated • Invocation • Record of a Derivation execution • These XML documents reside in a “virtual data catalog” – VDC - a relational database
VDL Describes Workflowvia Data Dependencies file1 TR tr1(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } TR tr2(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV x1->tr1(a1=@{in:file1}, a2=@{out:file2}); DV x2->tr2(a1=@{in:file2}, a2=@{out:file3}); x1 file2 x2 file3
Workflow example • Graph structure • Fan-in • Fan-out • "left" and "right" can run in parallel • Needs external input file • Located via replica catalog • Data file dependencies • Form graph structure preprocess findrange findrange analyze
Complete VDL workflow • Generate appropriate derivations DV top->preprocess( b=[ @{out:"f.b1"}, @{ out:"f.b2"} ], a=@{in:"f.a"} ); DV left->findrange( b=@{out:"f.c1"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="left", p="0.5" ); DV right->findrange( b=@{out:"f.c2"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="right" ); DV bottom->analyze( b=@{out:"f.d"}, a=[ @{in:"f.c1"}, @{in:"f.c2"} );
Compound TransformationsEnable Functional Abstractions • Compound TR encapsulates an entire sub-graph: TR rangeAnalysis (in fa, p1, p2, out fd, io fc1, io fc2, io fb1, io fb2, ) { call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ] ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="LEFT", p=${p1}, b=${out:fc1} ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="RIGHT", p=${p2}, b=${out:fc2} ); call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} ); }
Derivation scripts • Representation of virtual data provenance: DV d1->diamond( fd=@{out:"f.00005"}, fc1=@{io:"f.00004"}, fc2=@{io:"f.00003"}, fb1=@{io:"f.00002"}, fb2=@{io:"f.00001"}, fa=@{io:"f.00000"}, p2="100", p1="0" ); DV d2->diamond( fd=@{out:"f.0000B"}, fc1=@{io:"f.0000A"}, fc2=@{io:"f.00009"}, fb1=@{io:"f.00008"}, fb2=@{io:"f.00007"}, fa=@{io:"f.00006"}, p2="141.42135623731", p1="0" ); ... DV d70->diamond( fd=@{out:"f.001A3"}, fc1=@{io:"f.001A2"}, fc2=@{io:"f.001A1"}, fb1=@{io:"f.001A0"}, fb2=@{io:"f.0019F"}, fa=@{io:"f.0019E"}, p2="800", p1="18" );
Invocation Provenance Completion status and resource usage Attributes of executable transformation Attributes of input and output files
Executing VDL Workflows Grid Info Global planner “Pegasus” Concrete DAG Abstract workflow “jit” planner (research) DAGman / Condor-G local planner
GriPhyN-iVDGLApplications to date • ATLAS, BTeV, CMS – HEP event simulation • Argonne Computational Biology – sequence comparison and result capture • LIGO – Pulsar search • Sloan Digital Sky Survey – cluster finding; near-earth object search planned • Quarknet – science education – cosmic rays, HEP analysis
Genome Analysis Database Update Application work by Alex Rodriguez, Dina Sulakhe, Natalia Matlsev,Argonne MCS Described in GGF10workshop paper.
Virtual Data Example:Galaxy Cluster Search DAG Sloan Data Galaxy cluster size distribution Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, University of Chicago. Described in SC2002 paper
Cluster SearchWorkflow Graphand Execution Trace Workflow jobs vs time
mass = 200 decay = bb mass = 200 mass = 200 decay = ZZ mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = WW stability = 1 mass = 200 event = 8 mass = 200 decay = WW stability = 1 event = 8 mass = 200 plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 Virtual Data Application: High Energy Physics Data Analysis mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000 Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida Ref: CHEP 2002 paper
Using Virtual Data forScience Education • The QuarkNet-Trillium collaboration is using Grid virtual data tools and methods to enrich science education • Its an experiment to give students the means to: • discover and apply datasets, algorithms, and data analysis methods • collaborate by developing new ones and sharing results and observations • learn data analysis methods that will ready and excite them for a scientific career • And in later steps, we may actually use the Grid!
Student/TeacherTeams Student/TeacherTeams Student/TeacherTeams Quarknet Virtual Data Project Quarknet Virtual Data Portal Central High SchoolReston, Virginia Cosmic Ray Detector Locally Collected Data Student Data,Algorithms,Results, Notes, and communications Foothills High SchoolGreat Falls, Montana VirtualData Toolkit CosmicRayDetector Standard Web access LocallyCollected Data Virtual Data Catalog Yale / Middletown High CollaborationHartford, Connecticut CosmicRayDetector LocallyCollected Data Student teacher teams sharing data, methods, programs, and knowledge Enabling collaboration-intensive science discovery with virtual data tools and methods
Support for Search and Discovery • Goal: make it as easy to use as Google • More advanced capabilities lie below the surface (as with Google) • Understand the structure and meaning of the datasets and their fields. • Advanced search, using SQL-like queries • Find both DATA and TRANSFORMATIONS • Create datasets from queries • Perform calculations on datasets, filtering results to look for patterns
Virtual Provenance:list of derivations and files <job id="ID000001" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="5“ dv-namespace="Quarknet.HEPSRCH" dv-name="run1aesum"> <argument><filename file="run1a.event"/> <filename file="run1a.esm"/></argument> <uses file="run1a.esm" link="output" dontRegister="false" dontTransfer="false"/> <uses file="run1a.event" link="input" dontRegister="false" dontTransfer="false"/> </job> <job id="ID000002" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="7“ dv-namespace="Quarknet.HEPSRCH" … <argument><filename file="electron10GeV.event"/> <filenamefile="electron10GeV.sum"/></argument>… </job> <job id="ID000014" namespace="Quarknet.HEPSRCH" name="ReconTotalEnergy" level="3"… <argument><filename file="run1a.mis"/> <filename file="run1a.ecal"/> … <uses file="run1a.muon" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.total" link="output" dontRegister="false" dontTransfer="false"/> <uses file="run1a.ecal" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.hcal" link="input" dontRegister="false" dontTransfer="false"/> <uses file="run1a.mis" link="input" dontRegister="false" dontTransfer="false"/> </job> <!--list of all files used --> <filename file="ecal.pct" link="inout"/> <filename file="electron10GeV.avg" link="inout"/> <filename file="electron10GeV.sum" link="inout"/> <filename file="hcal.pct" link="inout"/>…. (excerpted for display)
Virtual Provenance in XML:control flow graph <child ref="ID000003"> <parent ref="ID000002"/> </child> <child ref="ID000004"> <parent ref="ID000003"/> </child> <child ref="ID000005"> <parent ref="ID000004"/> <parent ref="ID000001"/>… <child ref="ID000009"> <parent ref="ID000008"/> </child> <child ref="ID000010"> <parent ref="ID000009"/> <parent ref="ID000006"/>… <child ref="ID000012"> <parent ref="ID000011"/> </child> <child ref="ID000013"> <parent ref="ID000011"/> </child> <child ref="ID000014"> <parent ref="ID000010"/> <parent ref="ID000012"/>… <parent ref="ID000013"/>… </child>… (excerpted for display…)
Levels of Interaction • “Skins” – use it like a calculator, experiment with scenarios and settings, use virtual data like a log book to document, assess, and share parameter values. • “Blocks” – re-assemble workflow pipelines using existing ones as patterns and pre-developed transforms as building blocks • “Code” – write new transforms in a variety of languages and data models
Observations • A provenance approach based on interface definition and data flow declaration fits well with Grid requirements for code and data transportability and heterogeneity • Working in a provenance-managed system has many fringe benefits: uniformity, precision, structure, communication, documentation • The real world is messy – finding the right abstractions is hard, and handling “legacy” applications is even harder
Vision for Provenance in the Large • Universal knowledge management and production systems • Vendors integrate the provenance tracking protocol into data processing products • Ability to run anywhere “in the Grid”
Planned Dataset Model <FORM <Title…> /FORM> File Set of files Object closure XML Element Relational query or spreadsheet range New user-defined dataset type: Set of files with relational index Speculative model described in CIDR 2003 paper by Foster, Voeckler, Wilde and Zhao
Planned Dataset Type Model FileDataset Representational File FileSet Logical MultiFileSet TarFileSet EventCollection (Nonleaf Typesare Superclasses) RawEventSet SimulatedEventSet MonteCarloSimulation DiscreteEventSimulation
Provenance Server Plans • OGSA-based Grid services • Discovery, security, resource management • Supports code and data discoveryand workflow management • Object names (TR, DS, TY, DV, IV) can be used as global cross-server links • Derivations can reference remote transformations and datasets • Structured object namespaces & object-level access control enable large VO collaboration • Generalize transforms to describe service calls, database queries and language interpreters
For Information and Software • Virtual Data System • www.griphyn.org/chimera - Chimera Virtual Data System: Overview, papers, software • Grids and Grid Software • www.ivdgl.org/grid2003 - Using Grid3 • www.griphyn.org/vdt - Virtual Data Toolkit • www.globus.org – The Globus Toolkit • www.cs.wisc.edu/condor - The Condor Project • www.ppdg.net – Particle Physics Data Grid
Acknowledgements GriPhyN, iVDGL, and QuarkNet(in part) are supported by the National Science Foundation The Globus Alliance, PPDG, and QuarkNet are supported in part by the US Department of Energy, Office of Science; by the NASA Information Power Grid program; and by IBM