290 likes | 412 Views
The GriPhyN Virtual Data System. Ian Foster for the VDS team. Science as “Workflow”: E.g., Galaxy Cluster Search. DAG. Sloan Data. Galaxy cluster size distribution. Jim Annis, Steve Kent, Vijay Sehkri, Fermilab , Michael Milligan, Yong Zhao, University of Chicago.
E N D
The GriPhyNVirtual Data System Ian Foster for the VDS team
Science as “Workflow”:E.g., Galaxy Cluster Search DAG Sloan Data Galaxy cluster size distribution Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, University of Chicago
Requirements • Express complex multi-step “workflows” • Perhaps 100,000s of individual tasks • Operate on heterogeneous distributed data • Different formats & access protocols • Harness many computing resources • Parallel computers &/or distributed Grids • Execute workflows reliably • Despite diverse failure conditions • Enable reuse of data & workflows • Discovery & composition • Support many users, workflows, resources • Policy specification & enforcement
Virtual Data System • Express complex multi-step “workflows” • Perhaps 100,000s of individual tasks • Operate on heterogeneous distributed data • Different formats & access protocols • Harness many computing resources • Parallel computers &/or distributed Grids • Execute workflows reliably & efficiently • Despite diverse failure conditions • Enable reuse of data & workflows • Discovery & composition • Support many users, workflows, resources • Policy specification & enforcement VDL, XDTM Pegasus,DAGman, Globus VDC TBD
Workflow spec VDL Program Virtual Data catalog Virtual Data Workflow Generator Abstract workflow Virtual Data System Create Execution Plan Grid Workflow Execution Statically Partitioned DAG DAGman DAG DAGman & Condor-G Dynamically Planned DAG Job Planner Job Cleanup Local planner
600-1000+ CPUs Genome Analysis &DB Update (GADU)
The Rest of the Talk • Express complex multi-step “workflows” • Perhaps 100,000s of individual tasks • Operate on heterogeneous distributed data • Different formats & access protocols • Harness many computing resources • Parallel computers &/or distributed Grids • Execute workflows reliably & efficiently • Despite diverse failure conditions • Enable reuse of data & workflows • Discovery & composition • Support many users, workflows, resources • Policy specification & enforcement VDL, XDTM Pegasus,DAGman, Globus Ewa VDC TBD
“Messy” Scientific Data • Diverse storage formats & access protocols • Logically identical dataset can be stored in text file (e.g. CSV), binary file, spreadsheet • Data available from filesystem, database, HTTP, WebDAV, etc... • Metadata encoded in directory & file names • E.g.: “fMRI volume is composed of an image file & header file with same prefix” • Format dependency hinders program and workflow reuse
But... Data is Often Logically Structured • Scientific data often maintain hierarchical structure • A common practice is to select a set of data items and apply a transformation to each individual item • A nested approach of such iterations could scale up to millions of objects
Introducing a Typing System • Describe logical data structures as types … • … & physical representations as mappings • Define procedures in terms of typed datasets • … & apply procedures to different physical data • Compose workflows from typed procedures • Benefits • Type checking • Dataset selection and iteration • Discovery by types • Dynamic binding • Type conversion
XDTM(Moreau, Zhao, Wilde, Foster) • XML Dataset Typing and Mapping • Separates logical structure from physical representations • Logical structure described by XML Schema • Primitive scalar types: int, float, string, date … • Complex types (structs and arrays) • Mapping descriptor • How logical elements map to physical • External parameters (e. g. location) • XPath for dataset selection
Mapping • Define a common mapping interface • Initialize, read, create, write, close • Data providers implement the interface • Responsible for data access details • XView maintains cached logical datasets VDS Mapper Data Source XView VDS XViewMgr Mapper Data Source
Use Case: Functional MRI Logical Structure Physical Representation DBIC Archive Study #1 Group #1 Subject #1 Anatomy high-res volume Functional Runs run #1 volume #001 ... volume #275 ... run #5 volume #001 ... snrun #... … Group #5 ... Study #... DBIC Archive Study_2004.0521.hgd Group_1 Subject_2004.e024 volume_anat.img volume_anat.hdr bold1_001.img bold1_001.hdr ... bold1_275.img bold1_275.hdr ... bold5_001.img ... snrbold*_* air* ... Group_5 ... Study ...
Type Definitions in VDL type Image {}; type Header {}; type Volume { Image img; Header hdr; } type Anat Volume; type Warp {}; type NormAnat { Anat aVol; Warp aWarp; Volume nHires; } type Run { Volume v [ ]; } type Subject { Anat anat; Run run [ ]; Run snrun [ ]; } type Group { Subject s[ ]; } type Study { Group g[ ]; } Part of fMRI AIRSN (Spatial Normalization) Workflow
Type Definitions in XML Schema <xs:schema targetNamespace="http://www.fmri.org/schema/airsn.xsd" xmlns="http://www.fmri.org/schema/airsn.xsd" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:simpleType name="Image“/> <xs:simpleType name="Header“/> <xs:complexType name="Volume"> <xs:sequence> <xs:element name="img" type="Image"/> <xs:element name="hdr" type="Header"/> </xs:sequence> </xs:complexType> <xs:complexType name="Run"> <xs:sequence minOccurs="0 maxOccurs="unbounded"> <xs:element name="v" type="Volume"/> </xs:sequence> </xs:complexType> </xs:schema>
Procedure Definition in VDL • (Run snr) functional( Run r, NormAnat a, Air shrink ) { • Run yroRun = reorientRun( r , "y" ); • Run roRun = reorientRun( yroRun , "x" ); • Volume std = roRun[0]; • Run rndr = random_select( roRun, .1 ); //10% sample • AirVector rndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, [81,3,3] ); • Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k"); • Volume meanRand = softmean(reslicedRndr, "y", null ); • Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, [81,3,3] ); • Volume mnQA = reslice( meanRand, mnQAAir, "o", "k“ ); • Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir ); • Run nr = reslice_warp_run( boldNormWarp, roRun ); • Volume meanAll = strictmean ( nr, "y", null ) • Volume boldMask = binarize( meanAll, "y" ); • snr = gsmoothRun( nr, boldMask, 6, 6, 6 ); • }
Dataset Iteration • Functional analysis expressed in typed datasets • Iterate over each volume in a run
Expanded Execution Plan • Datasets dynamically instantiated from data sources by mappers
Code Size Comparison Lines of code with different workflow encodings
The Rest of the Talk • Express complex multi-step “workflows” • Perhaps 100,000s of individual tasks • Operate on heterogeneous distributed data • Different formats & access protocols • Harness many computing resources • Parallel computers &/or distributed Grids • Execute workflows reliably & efficiently • Despite diverse failure conditions • Enable reuse of data & workflows • Discovery & composition • Support many users, workflows, resources • Policy specification & enforcement VDL, XDTM Pegasus,DAGman, Globus VDC TBD
fMRI Virtual Data Queries Which transformations can process a “subject image”? • Q: xsearchvdc -q tr_meta dataType subject_image input • A: fMRIDC.AIR::align_warp List anonymized subject-images for young subjects: • Q: xsearchvdc -q lfn_meta dataType subject_image privacy anonymized subjectType young • A: 3472-4_anonymized.img Show files that were derived from patient image 3472-3: • Q: xsearchvdc -q lfn_tree 3472-3_anonymized.img • A: 3472-3_anonymized.img 3472-3_anonymized.sliced.hdr atlas.hdr atlas.img … atlas_z.jpg 3472-3_anonymized.sliced.img
Provenance for ATLAS DC2(High Energy Physics) How much compute time was delivered? | years| mon | year | +------+------+------+ | .45 | 6 | 2004 | | 20 | 7 | 2004 | | 34 | 8 | 2004 | | 40 | 9 | 2004 | | 15 | 10 | 2004 | | 15 | 11 | 2004 | | 8.9 | 12 | 2004 | +------+------+------+ Selected statistics for one of these jobs: start: 2004-09-30 18:33:56 duration: 76103.33 pid: 6123 exitcode: 0 args: 8.0.5 JobTransforms-08-00-05-09/share/dc2.g4sim.filter.trf CPE_6785_556 ... -6 6 2000 4000 8923 dc2_B4_filter_frag.txt utime: 75335.86 stime: 28.88 minflt: 862341 majflt: 96386 Which Linux kernel releases were used ? How many jobs were run on a Linux 2.4.28 Kernel?
LIGO Inspiral Search Application • Describe… Inspiral workflow application is the work of Duncan Brown, Caltech, Scott Koranda, UW Milwaukee, and the LSC Inspiral group
Remote Directory Creation for Ensemble Member 1 Remote Directory Creation for Ensemble Member 2 Remote Directory Creation for Ensemble Member N FOAM:Fast Ocean/Atmosphere Model250-Member EnsembleRun on TeraGrid under VDS FOAM run for Ensemble Member 1 FOAM run for Ensemble Member 2 FOAM run for Ensemble Member N Atmos Postprocessing Atmos Postprocessing for Ensemble Member 2 Ocean Postprocessing for Ensemble Member 2 Coupl Postprocessing for Ensemble Member 2 Coupl Postprocessing for Ensemble Member 2 Results transferred to archival storage Work of: Rob Jacob (FOAM), Veronica Nefedova (workflow design and execution)
FOAM and VDS 160 ensemble members in 75 days Climate Supercomputer andGrad student 250 ensemble members in 4 days TeraGrid and VDS Visualization courtesy Pat Behling and Yun Liu, UW Madison
Summary:Science as Workflow Executed Executing Query Executable Not yet executable What I Did What I Am Doing Edit … What I Want to Do Execution environment Schedule
Acknowledgements • The Virtual Data System group is: • ISI/USC: Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi • U of Chicago: Ben Clifford, Ian Foster, Mike Wilde, Yong Zhao • GriPhyN is supported by the NSF • Many research efforts involved in this work are supported by the US Department of Energy, Office of Science