220 likes | 342 Views
Large-Scale Science Through Workflow Management. Ewa Deelman Center for Grid Technologies USC Information Sciences Institute. Acknowledgements. Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi (Center for Grid Technologies, ISI)
E N D
Large-Scale Science Through Workflow Management Ewa Deelman Center for Grid Technologies USC Information Sciences Institute
Acknowledgements • Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Mei-Hui Su, Karan Vahi (Center for Grid Technologies, ISI) • James Blythe, Yolanda Gil (Intelligent Systems Division, ISI) • http://pegasus.isi.edu • Research funded as part of the NSF GriPhyN, NVO and SCEC projects and EU-funded GridLab Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu
Today’s Scientific Applications • Increasing in the level of complexity • Use of individual application components • Reuse of individual intermediate data products (files) • Description of Data Products using Metadata Attributes • Execution environment is complex and very dynamic • Resources come and go • Data is replicated • Components can be found at various locations or staged in on demand • Separation between • the application description • the actual execution description Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu
Workflow Definitions • Workflow template: shows the main steps in the scientific analysis and their dependencies without specifying particular data products • Abstract workflow: depicts the scientific analysis including the data used and generated, but does not include information about the resources needed for execution • Concrete workflow: an executable workflow that includes details of the execution environment Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu
Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu
Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu
Concrete Workflow Generation and Mapping Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu
Pegasus:Planning for Execution in Grids • Maps from abstract to concrete workflow • Algorithmic and AI-based techniques • Automatically locates physical locations for both workflow components and data • Finds appropriate resources to execute • Reuses existing data products where applicable • Publishes newly derived data products • Provides provenance information Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu
Generating a Concrete Workflow Information • location of files and component Instances • State of the Grid resources Select specific • Resources • Files • Add jobs required to form a concrete workflow that can be executed in the Grid environment • Data movement • Data registration • Each component in the abstract workflow is turned into an executable job Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu
Information Components used by Pegasus • Globus Monitoring and Discovery Service (MDS) • Locates available resources • Finds resource properties • Dynamic: load, queue length • Static: location of GridFTP server, RLS, etc • Globus Replica Location Service • Locates data that may be replicated • Registers new data products • Transformation Catalog • Locates installed executables Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu
Example Workflow Reduction • Original abstract workflow • If “b” already exists (as determined by query to the RLS), the workflow can be reduced Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu
Mapping from abstract to concrete • Query RLS, MDS, and TC, schedule computation and data movement Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu
Pegasus Research • resource discovery and assessment • resource selection • resource provisioning • workflow restructuring • task merged together or reordered to improve overall performance • adaptive computing • Workflow refinement adapts to changing execution environment Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu
Benefits of the workflow & Pegasus approach • The workflow exposes • the structure of the application • maximum parallelism of the application • Pegasus can take advantage of the structure to • Set a planning horizon (how far into the workflow to plan) • Cluster a set of workflow nodes to be executed as one (for performance) • Pegasus shields from the Grid details Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu
Benefits of the workflow & Pegasus approach • Pegasus can run the workflow on a variety of resources • Pegasus can run a single workflow across multiple resources • Pegasus can opportunistically take advantage of available resources (through dynamic workflow mapping) • Pegasus can take advantage of pre-existing intermediate data products • Pegasus can improve the performance of the application. Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu
Mosaic of M42 created on the Teragrid resources using Pegasus Pegasus improved the runtime of this application by 90% over the baseline case Bruce Berriman, John Good (Caltech) Joe Jacob, Dan Katz (JPL) Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu
Future Directions Support for workflows with real-time feedback to scientists. Providing intermediate analysis results so that the experimental setup can be adjusted while the short-lived samples or human subjects are available. Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu
Cognitive Grids: Distributed Intelligent Reasoners that Incrementally Generate the Workflow User’s Workflow refinement Request Levels of abstraction Policy reasoner Application Workflow repair -level knowledge Relevant components Logical tasks Full abstract workflow Tasks Not yet bound to Onto-based Matchmaker executed resources Partial and sent for executed execution execution Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu time
BLAST: set of sequence comparison algorithms that are used to search sequence databases for optimal local alignments to a query • 2 major runs were performed using Chimera and Pegasus: • 60 genomes (4,000 sequences each), • In 24 hours processed Genomes selected from DOE-sponsored sequencing projects • 67 CPU-days of processing time delivered • ~ 10,000 Grid jobs • >200,000 BLAST executions • 50 GB of data generated • 2) 450 genomes processed • Speedup of 5-20 times were achieved because the compute nodes we used efficiently by keeping the submission of the jobs to the compute cluster constant. Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu Lead by Veronika Nefedova (ANL) as part of the Paci Data Quest Expedition program
Tomography (NIH-funded project) • Derivation of 3D structure from a series of 2D electron microscopic projection images, • Reconstruction and detailed structural analysis • complex structures like synapses • large structures like dendritic spines. • Acquisition and generation of huge amounts of data • Large amount of state-of-the-art image processing required to segment structures from extraneous background. Dendrite structure to be rendered by Tomography Work performed with Mark Ellisman, Steve Peltier, Abel Lin, Thomas Molina (SDSC) Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu
LIGO’s pulsar search at SC 2002 • The pulsar search conducted at SC 2002 • Used LIGO’s data collected during the first scientific run of the instrument • Targeted a set of 1000 locations of known pulsar as well as random locations in the sky • Results of the analysis were be published via LDAS (LIGO Data Analysis System) to the LIGO Scientific Collaboration • performed using LDAS and compute and storage resources at Caltech, University of Southern California, University of Wisconsin Milwaukee. ISI people involved: Gaurang Mehta, Sonal Patil, Srividya Rao, Gurmeet Singh, Karan Vahi Visualization by Marcus Thiebaux Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu
Southern California Earthquake Center • Southern California Earthquake Center (SCEC), in collaboration with the USC Information Sciences Institute, San Diego Supercomputer Center, the Incorporated Research Institutions for Seismology, and the U.S. Geological Survey, is developing theSouthern California Earthquake Center Community Modeling Environment (SCEC/CME). • Create fully three-dimensional (3D) simulations of fault-system dynamics. • Physics-based simulations can potentially provide enormous practical benefits for assessing and mitigating earthquake risks through Seismic Hazard Analysis (SHA). • The SCEC/CME system is an integrated geophysical simulation modeling framework that automates the process of selecting, configuring, and executing models of earthquake systems. Acknowledgments : Philip Maechling and Vipin Gupta University Of Southern California Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu