330 likes | 475 Views
From Grid to Global Computing: Deploying Parameter Sweep Applications. Henri Casanova Grid Research And Innovation Laboratory (GRAIL) http://grail.sdsc. edu/ San Diego Supercomputer Center (SDSC) Computer Science and Engineering Dept. (CSE) University of California, San Diego (UCSD).
E N D
From Grid to Global Computing:Deploying Parameter SweepApplications Henri Casanova Grid Research And Innovation Laboratory (GRAIL) http://grail.sdsc.edu/ San Diego Supercomputer Center (SDSC) Computer Science and Engineering Dept. (CSE) University of California, San Diego (UCSD)
Parameter Sweep Applications Input data Tasks Raw Output Post-processing Final Output • Many compute tasks • No or simple dependencies • Several output post-processing stages • Potentially large datasets
Relevance • Arise in virtually every field of science an engineering • Monte Carlo, Parameter Space Searches, Parameter Studies, etc. • Biology, Astrophysics, Physics, Bioinformatics, Economics, etc. • Primary candidate for Grid computing • Latency-tolerant, amenable to simple fault-tolerance • Need huge amount of resources
Outline of the Presentation • Parameter Sweep Applications (PSAs) • APST • The Virtual Instrument • BIO@Home
Scheduling of PSAs ? Grid
Grid Scheduling Practice • Ad-hoc solutions: • specific to one application • hand-tuned to the environment (e.g. SF-Express demo) • Large body of work on Scheduling • What can we re-use on the Grid? • Heterogeneous resources • Dynamic performance characteristics • Resources downtimes • Complex network topologies • Performance prediction errors
“DataGrid” Scheduling Goal: Co-locate/replicate data and computation • Dynamic Priority List-Scheduling • Built on heuristics described in [Ibarra77, Siegel99] • Added adaptivity • Simulation results • List-scheduling works, adaptivity should make it practical • Experimental results (Demo at SC’00 and SC’01) [HCW’00] H. Casanova, A. Legrand, et al.
Lessons • Much scheduling work to re-use • List-scheduling with Dynamic Priorities seems effective • Simulation • Experimental • Let’s build software that uses it • Let’s target scientific communities
Motivation for APST • Started as scheduling research • Evolved into a tool that provides • Transparency of Grid execution • Data movements • Remote job management • Multiple Grid middleware back-ends • Scheduling • Self-scheduling • List scheduling w/ dynamic priorities
APST Designs Scheduler Metadata Bookkeeper Decisions Information XML application and resource descriptions Compute Transport Actions • The AppLeS Parameter Sweep Template: An Application Execution Environment APST Grid Services APST client Grid
APST: Lessons • The Grid is difficult to use • APST provides a simple software layer that does one thing well • Minimal user interface (XML, command-line) • Used as a building block for domain-specific applications • E.g. multi-cluster bio-informatics (Singapore) • Ssh? • Default mechanism • Critical for gaining user buy in • Natural way to lead to using the Grid
APST Status • Version 1.1 released 2 weeks ago • Available for public download • Used for 10+ applications • Bioinformatics (BLAST, HMM, …) • Computational Neuro-science • Globus, NetSolve, Ssh, Condor • GASS, IBP, Scp, GridFTP, SRB, • NWS, MDS, Ganglia,… http://grail.sdsc.edu/projects/apst
APST Research Directions • APST is a research platform • Maintained by one staff • Several graduate student contributors • Partitionable Workload • Bioinformatics (database splitting) • Factoring: Decrease chunk size • Pipelining: Increase chunk size • Combined? • Create APST-BLAST (Mario Lauria, OSU Yang Yang, UCSD)
Outline of the Presentation • Parameter Sweep Applications (PSAs) • APST • Virtual Instrument • BIO@home
Computational Neuroscience • MCell: Monte Carlo Cell simulator • Developed at Salk and PSC • Gain knowledge about neuro-transmission mechanisms • Fundamental for drug design (psychiatry) • Large user base (yearly MCell workshop) • Parallel MC simulations at the molecular level
Traditional MCell usage • “By hand” • No automatic project management • No transparent resource access • No automated data management • Consequences • No interactive simulations • No fault-tolerance, scheduling, … • MCell limited to resources in the lab
MCell and APST • APST alleviates some of the limitations • Large-scale simulations • Fault-tolerance and scheduling • Data retrieval from distributed storage • XML application descriptions • No interactivity • MCell is exploratory • User interaction is fundamental for many users
The Virtual Instrument • $2.5M funding from the NSF • Salk, PSC, UCSB, UTK, UCSD • A running MCell simulation should behave as a lab instrument • Computational steering for MCell • User interface • Grid software • Application software • Scheduling research (how does one scheduling an application that’s being steered interactively?)
VI Software Grid Storage and Compute Resources control + data VI Daemon compute Grid Services control VI Interface control + data process VI Database VI User data data OpenDX storage
Scheduling Goals • Reduce the “search” time • Let user assign levels of importance to regions on the parameter space • Assign fraction of resources with respect to the importance levels • Assign priorities to tasks • Interesting questions • Job control limited on Grid resource • Cannot assign exact fractions • Interesting trade-offs between control overhead and accuracy of priorities
Current Status • First software prototype released in Feb 2002 • Globus and Ssh • MySQL • OpenDX • priority-based scheduling • 20,000 lines of C++ • Upcoming papers • JPDC submission • Scheduling paper (SC submission)
Outline of the Presentation • Parameter Sweep Applications (PSAs) • PSAs on the Grid with APST • MCell Virtual Instrument • Global Computing
SETI@home • Over 500,000 active participants, most of which run screensaver on home PC • Over a cumulative 20 TeraFlop/sec • Versus 12.3 TeraFlop/sec of IBM’s ASCI White • Cost: $500,000 + $200,000 in donated hardware • Less than 1% of the $110 million required for ASCI White
Global vs. Grid Computing • Nature of resources • Home desktops running Windows and are completely autonomous • Machines powered on and off by user • Behind firewalls, dynamic IP, transient network connections • Programming model • Server cannot “push” tasks to clients • Server has no little means for remote job control • Server has incomplete information about resources and availability
Goal • SETI@home limitations: • Embarrassingly parallel • Infinite amount of input data • Pure throughput • Can we do something more? • Short-lived applications? • Parallel applications? • Compute service? • BIO@Home • Smith-Waterman for short/long sequences • No real software yet (build on XtremWeb?)
Scheduling? • Sophisticated scheduling algorithms need information and control • At the moment: Simple mechanisms • Work unit duplication Specifies max number of times a work unit can be resent • Timeouts Time that must elapse before work unit is resent
Simulation • Built a simulation model • Using statistics/surveys/extrapolations • Next: logs from real systems (XtremWeb?, Entropia?) • Evaluated the impact of both mechanisms on performance and throughput
Early Lessons • Trade-off between throughput and turn-around time • Duplication: • aggressively decreases turn-around time • wastes resources • there is an optimal value • Timeouts: • moderately lowers turnaround times • preserves good throughput • infinite timeouts is of course not a good idea
Future work • Two knobs • Question: A compute service? • Mix of applications (SETI, short-lived, …) • Singapore Bio-informatics institute • Notion of fairness? • How do we implement policy with many volatile resources? • Software • Re-use existing platforms: • XtremWeb • Entropia
Conclusion • APST, Virtual Instrument, BIO@Home • Other GRAIL activities I didn’t talk about • Scientific Computing • Simulation • Adaptive Scheduling • Networking http://grail.sdsc.edu
Experimental Results TITECH GASS • Self-scheduling • XSufferage Ssh UCSD Globus Tokyo NWS GASS UTK IBP NetSolve