210 likes | 225 Views
This paper discusses the challenges of implementing virtual grid abstractions in resource management optimization, including simplifying resource abstraction, enabling better optimization, scaling to larger resource environments, and providing useful and current information for dynamic decisions.
E N D
VGrADS Runtime System Architecture and Research Andrew Chien, Henri Casanova, Rich Wolski, Carl Kesselman, Fran Berman, Dan Reed, Jack Dongarra Feb 2004 Kickoff Meeting Rice University
Runtime Challenges • Simplify Resource Abstraction for PPS • Enable Simpler and Better Optimization • Scale to Larger and More Complex Resource Environment • Grids resource pools are large and growing • Provide Better Information • Useful and Current Information for Dynamic Decisions
What is a Virtual Grid? “PPS/Application” • A Resource Management Oriented Abstraction • Provides Simple Resource Performance Model for the PPS/Application • Structures Information Collection by the Execution System • Accelerate And Improve Decision Making About Resources • Reduced Scope Of Resource Monitoring And Scheduling • Improved Scalability And Quality • Enables Proactive And Reactive Resource Monitoring, Acquisition To Improve Properties Of Performance, Reliability, Stability, Security, Etc. “Virtual Grids” “The Grid”
SDSC SDSC SDSC Caltech Caltech Caltech 10 Gb/s5ms 10 Gb/s5ms 10 Gb/s5ms 10 Gb/s5ms 10 Gb/s5ms 10 Gb/s5ms TeraGrid Backplane TeraGrid Backplane TeraGrid Backplane 30 Gb/s 20ms 30 Gb/s 20ms 30 Gb/s 20ms 10 Gb/s5ms 10 Gb/s5ms 10 Gb/s5ms 10 Gb/s5ms 10 Gb/s5ms 10 Gb/s5ms 10 Gb/s5ms 10 Gb/s5ms 10 Gb/s5ms 1 Gb/s 1 Gb/s 1 Gb/s 1 Gb/s 1 Gb/s 1 Gb/s 1 Gb/s 1 Gb/s 1 Gb/s NCSA NCSA NCSA PSC PSC PSC ANL ANL ANL Figure 5. Teragrid topology Figure 5. Teragrid topology Figure 5. Teragrid topology 1 Gb/s 1 Gb/s 1 Gb/s 1 Gb/s 1 Gb/s 1 Gb/s How are Virtual Grid Abstractions Defined? App • Top-down (Application => Virtualization) • Bottom-up (Resource Properties – Individual & Aggregate => Virtualization) • Better Aggregate Resource Capabilities – Rapid Selection and Binding) • Attributes: Resource Properties, Communication Structure, Aggregates Over These Such as Reliability, Quality of Service (Into Picture) Resources
Application-Driven Example • Dataset and Computations on Data Elements • EOL: Genomic and Proteomic Databases • Annotation Pipeline Employs Myriad Applications and Heterogeneous Workers • Computations Operate on Parts of the Dataset and Compute New Elements of New Datasets
SDSC Caltech 10 Gb/s5ms 10 Gb/s5ms TeraGrid Backplane 30 Gb/s 20ms 10 Gb/s5ms 10 Gb/s5ms 10 Gb/s5ms 1 Gb/s 1 Gb/s 1 Gb/s NCSA PSC ANL Figure 5. Teragrid topology 1 Gb/s 1 Gb/s Resource-Driven Example • TeraGrid: Five Clusters, Fast Network • Direct Access to Individual Clusters and Parts • Virtualized as Clusters (Multiple Choices) and as a Single Cluster • One/Part of a Cluster Dedicated as Data Servers Single Cluster Data Server+ Clusters Singleton Clusters Teragrid
Example: Uniform Parallel Grid • N Uniform Performance Nodes; Rich connectivity • Range of “close approximations”
Heterogeneous Collection • Oracle Grid: Database and Cluster of Workers • Abstract Component Machine?
And many more… • Cluster of clusters • Big bag of processors • Big bag of disks • Humongous bag of processors • Humongous bag of disks • …
Uniform Cluster Heterogeneous Cluster … Generic or Custom Grid Services Rsc Info/ Perf Monitor Rsc Access Selection Checkpointing & Fault Toler. Schedule/ Reschedule Proactive Rsc Reserve/Bind Narrowed Scope Resource Classes: Classification, Composition, Virtualization Secure Clusters x86 Clusters IA64 Clusters Desktop Grid … Virtual Grid Architecture Application/PPS/Libraries Compute: (inst set, special operations, libraries, Etc.) Comm: (multicast, reduce, P2p, Lambda’s, etc.) Dynamic RM: (add rsc, release, inquire, swap, overallocate, reserve, etc.) VGrid RSpec: (security, reliability, communication, performance, location, QoS, etc)
Virtual Grid Runtime Challenges: Implement the Abstractions • Custom Abstractions: Intelligence for Each • Scheduling, Composition, Proactive techniques • Resource Characterization and Classification • Monitoring and Detecting Pailures and Performance Violations of VG Abstractions • Rapid Rescheduling in Response to Failures and Performance Violations • Scaling and Information and Decision Quality • Others?
Resource Classes • Characterization and Organization of Resources • => Short and Long-term Monitoring and Analysis of Resources • Many Open Questions • What are the Meaningful and Useful Resource Classes? • How do We Both Support Large-scale of Resources, Yet Refined Classification? • Is this a Multi-classification? • Is this Centralized or Decentralized or Both?
Virtual Grid GrADSoft vs. Virtual Grids and GrADS Conceptual GrADSoft • Broad, General-Purpose Model vs. • Narrow / Specialized Model per Abstraction
Virtual Grids and GrADS Architecture • Small Set of Virtual Grid Abstractions = Performance Models = PPS View • Decouples the Optimization Problems • BUT, coordination on adaptation still required • Customized Information Collection (per PPS view) • Customized Resource Management / Scheduling (per PPS view) • Customized monitoring (per PPS view)
Initial Steps and Activities • Take Familiar and Important Application/Workloads and Explore Issues • What Type of Virtual Grid Might an Application Specify? How Might We Exploit These Attributes for Better Selection/Scheduling, Etc. • Initial Work On EOL and Speeding Critical Phases • Take Typical Resource Configurations And Elicit Structure • What are Grid Resource Configuration? • What are Their Characteristics (Static, Dynamic) • Do These Classify Naturally Fall into Structured Classification? • Can We Reduce the Scope Needed thru Virtual Grid Mechanism? • Explore Future Grid Information Systems • What Information Can be Provided with What Resolution and Accuracy, Scaling? • Can Virtual Grid Improve and Organize the Quality of Information for Adaptation? • Techniques To Make The Gathering And Distribution Of Different Types Of Information More Efficient And Scalable
Runtime Deliverables • September 2004, end Year 1 Virtual Grid • Prototype Resource Virtualization and Abstraction Classes [V1] • Virtual Scheduling requirements study [V2] Performance Provisioning: • Initial time-space reasoning for contracts and signatures [PP1] Grid Economy: • Develop rudimentary simulation of VGrADS resource allocation mechanisms. [GE1] • Begin the exploration of Tatonnement, Smale's method, and Continuous-Price Double auctions using simulation. [GE2] Fault Tolerance: • Experimental measurement of Grid & cluster reliability [FT1]
Runtime Deliverables (cont.) • September 2005, end Year 2 Virtual Grid: • Prototype Virtual Grid examples defined [V3] • Prototype virtual scheduler [V4] Performance Provisioning: • Extended time-space reasoning for contracts and signatures [PP2] Grid Economy: • Determine initial pricing conditions and pricing methods that prevent multiple equilibria. [GE3] • Verify stability results using simulation environment. [GE4] Fault Tolerance: • Prototype fault tolerant library [FT2]
And beyond… • September 2006, end Year 3 Virtual Grid: • Novel resource selection and virtual scheduling strategy experiments with application kernels on virtual grid environments [V5] Performance Provisioning: • Limited tunable performance/fault-tolerance capabilities [PP3] Grid Economy: • Begin designing experiments to test pricing techniques using VGrADS framework. [GE5] • Continue simulation experiments to evaluate resource allocation efficiency. [GE6] Fault Tolerance: • Consider novel techniques [FT3] • September 2007, end Year 4 • September 2008, end Year 5
Research Questions I • How General and Precise a Description Language for Virtual Grid Abstractions do We NEED? Or do Application/PPS WANT? • Vgrid is an SOA, All Can Be Used Electively – No Layering • Can We Meaningfully Support Use of the System and Modification at Multiple Levels of Abstraction? • What are the New Scheduling, Adaptation, Monitoring Capabilities and Opportunities of Virtual Grid? Can We Prove Properties Relative to Global Grid Views? • What are a Minimal and Expressive Set of Resource Management Services for Virtual Grid Abstractions? Pass Appropriate Information to Allow Lower Level Optimization; Higher Level Control
Research Questions II • Where Do Ideas Of Transparent Fault-tolerance Fit? Embedded In For Example In Many Reliable Virtual Grid Abstractions? • Classification: What Are The Meaningful/Useful Resource Classes? • How Do We Both Support Large-scale Of Resources, Yet Refined Classification? • Is This A Multi-classification? Is This Centralized Or Decentralized Or Both? • How Do These Affect The Interfaces To The Program Preparation System? • What Interfaces Might be Preserved? • Move Towards A Limited Set Of Descriptions, Conversion, Basic Resource Selection • SOA Based on Java or WS-resource • Potentially A Separate Implementation Of Each VG Abstraction; Significant Sharing Expected • How To They Affect The Functionality Needed In The Program Preparation System? • Incremental Adaptation? Checkpointing? Fault Tolerance?