290 likes | 404 Views
Virtualization in MetaSystems. Vaidy Sunderam Emory University, Atlanta, USA vss@emory.edu. Credits and Acknowledgements. Distributed Computing Laboratory, Emory University Dawid Kurzyniec, Piotr Wendykier, David DeWolfs, Dirk Gorissen, Maciej Malawski, Vaidy Sunderam Collaborators
E N D
Virtualization in MetaSystems Vaidy Sunderam Emory University, Atlanta, USA vss@emory.edu
Credits and Acknowledgements • Distributed Computing Laboratory, Emory University • Dawid Kurzyniec, Piotr Wendykier, David DeWolfs, Dirk Gorissen, Maciej Malawski, Vaidy Sunderam • Collaborators • Oak Ridge Labs (A. Geist, C. Engelmann, J. Kohl) • Univ. Tennessee (J. Dongarra, G. Fagg, E. Gabriel) • Sponsors • U. S. Department of Energy • National Science Foundation • Emory University
Virtualization • Fundamental and universal concept in CS, but receiving renewed, explicit recognition • Machine level • Single OS image: Virtuozo, Vservers, Zones • Full virtualization: VMware, VirtualPC, QEMU • Para-virtualization: UML, Xen (Ian Pratt et. al, cl.cam.uk) • “Consolidate under-utilized resources, avoid downtime, load-balancing, enforce security policy” • Parallel distributed computing • Software systems: PVM, MPICH, grid toolkits and systems • Consolidate under-utilized resources, avoid downtime, load-balancing, enforce security policy + aggregate resources
Virtualization in PVM • Historical perspective – PVM 1.0, 1989
Key PVM Abstractions • Programming model • Timeshared, multiprogrammed virtual machine • Two-level process space • Functional name + ordinal number • Flat, open, reliable messaging substrate • Heterogeneous messages and data representation • Multiprocessor emulation • Processor/process decoupling • Dynamic addition/deletion of processors • Raw nodes projected • Transparently • Or with exposure of heterogeneous attributes
Parallel Distributed Computing • Multiprocessor systems • Parallel distributed memory computing • Stable and mainstream: SPMD, MPI • Issues relatively clear: performance • Platforms • Applications • Correspondingly tightly coupled
Parallel Distributed Computing • Metacomputing and grids • Platforms • Parallelism • Possibly within components, but mostly loose concurrency or pipelining between components (PVM: 2-level model) • Grids: resource virtualization across multiple admin domain • Moved to explicit focus on service orientation • “Wrap applications as services, compose applications into workflows”; deploy on service oriented infrastructure • Motivation: service/resource coupling • Provider provides resource and service; virtualized access
Virtualization in PDC • What can/should be virtualized? • Raw resource • CPU : process/task instantiation => staging, security etc • Storage : e.g. network file system over GMail • Data : value added or processed • Service • Define interface and input-output behavior • Service provider must operate the service • Communication • Interaction paradigm with strong/adequate semantics • Key capability: • Configurable/reconfigurable resource, service, and communication
The Harness II Project • Theme • Virtualized abstractions for critical aspects of parallel distributed computing implemented as pluggable modules, (including programming systems) • Major project components • Fault-tolerant MPI: specification, libraries • Container/component infrastructure: C-kernel, H2O • Communication framework: RMIX • Programming systems: • FT-MPI + H2O, MOCCA (CCA + H2O), PVM
Cooperatingusers App 1 App 2 Applications PVM FT-MPI Comp. Activeobjects ... Programming model Virtual layer DVM-enabling components Provider B Provider A Provider C Harness II • Aggregation for Concurrent High Performance Computing • Hosting layer • Collection of H2O kernels • Flexible/lightweight middleware • Equivalent to Distributed Virtual Machine • But only on client side • DVM pluglets responsible for • (Co) allocation/brokering • Naming/discovery • Failures/migration/persistence • Programming environments: FT- MPI, CCA, paradigm frameworks, distributed numerical libraries
Providers Clients Network H2O Middleware Abstraction • Providers own resources • Independently make them available over the network • Clients discover, locate, andutilize resources • Resource sharing occurs between single provider and single client • Relationships may betailoredas appropriate • Including identity formats, resource allocation, compensation agreements • Clients can themselves be providers • Cascading pairwise relationships maybe formed
A Provider host <<create>> B Container Provider Lookup& use Deploy Client Traditional model A B Provider host Deploy <<create>> Container Provider,Client,or Reseller Provider Lookup& use Client H2O model H2O Framework • Resources provided as services • Service = active software component exposing functionality of the resource • May represent „added value” • Run within a provider’s container (execution context) • May be deployed by any authorized party: provider, client, or third-party reseller • Provider specifies policies • Authentication/authorization • Actors kernel/pluglet • Decoupling • Providers/providers/clients
Registration and Discovery UDDI JNDI LDAP DNS GIS e-mail,phone, ... ... Publish Find ... Deploy Provider A nativecode A Deploy Client Provider Client Provider Client Provider Deploy A B A B B Reseller Developer LegacyApp Repository Repository A B A B C C Example usage scenarios • Resource = legacy application • Provider deploys the service • Provider stores the information about the service in a registry • Client discovers the service • Client accesses legacy application through the service • Resource = computational service • Reseller deploys software component into provider’s container • Reseller notifies the client about the offered computational service • Client utilizes the service • Resource = raw CPU power • Client gathers application components • Client deploys components into providers’ containers • Client executes distributed application utilizing providers’ CPU power
Pluglet Kernel Model and Implementation Interface StockQuote { double getStockQuote(); } Clients • H2O nomenclature • container = kernel • component = pluglet • Object-oriented model, Java and C-based implementations • Pluglet = remotely accessible object • Must implement Pluglet interface, may implement Suspendible interface • Used by kernel to signal/trigger pluglet state changes • Model • Implement (or wrap) service as a pluglet to be deployed on kernel(s) Functionalinterfaces (e.g. StockQuote) Pluglet [Suspendible] Interface Pluglet { void init(ExecutionContext cxt); void start(); void stop(); void destroy(); } Interface Suspendible { void suspend(); void resume(); }
Accessing Virtualized Services • Request-response ideally suited, but • Stateful service access must be supported • Efficiency issues, concurrent access • Asynchronous access for compute intensive service • Semantics of cancellation and error handling • Many approaches focus on performance alone and ignore semantic issues • Solution • Enhanced procedure call/method invocation • Well understood paradigm, extend to be more appropriate to access metacomputing services
E A D B C F H2O kernel The RMIX layer • H2O built on top of RMIX communication substrate • Provides flexible p2p communication layer for H2O applications • Enable various message layer protocols within a single, provider-based framework library • Adopting common RMI semantics • Enable high performance and interoperability • Easy porting between protocols, dynamic protocol negotiation • Offer flexible communication model, but retain RMI simplicity • Extended with: asynchronous and one-way calls • Issues: Consistency, Ordering, Exceptions, Cancellation Java Web Services RPC clients H2O kernel SOAP clients ... RMIX RMIX Networking Networking RPC, IIOP, JRMP, SOAP, …
Service Access RMIX RMIXJRMPX RMIXXSOAP RMIXRPCX RMIX Myri Java Web Services ONC-RPC GM SOAP clients RMIX Overview • Extensible RMI framework • Client and provider APIs • uniform access tocommunication capabilities • supplied by pluggable provider implementations • Multiple protocols supported • JRMPX, ONC-RPC, SOAP • Configurable and flexible • Protocol switching • Asynchronous invocation
H2O Pluglet Client or Server security H2O Pluglet Client or Server Internet firewall efficiency H2O Pluglet Client or Server H2O Pluglet Harness Kernel efficiency H2O Pluglet Client or Server RMIX Abstractions • Uniform interface and API • Protocol switching • Protocol negotiation • Various protocol stacks for different situations • SOAP: interoperability • SSL: security • ARPC, custom (Myrinet, Quadrics): efficiency Asynchronous access to virtualized remote resources
Asynchronous RMIX Client Server DisregardAt Client-Side Call Initiation InterruptClient I/O ParameterMarshalling ParameterUnmarshalling DisregardAt Server-Side :stub Method Call Interrupt Server Thread :param create() InterruptServer I/O ResultMarshalling asyncCall() ResultUnmarshalling modify() Ignore Result Reset server state :stub :target read() Result Delivery Cancellation at various stages of the call “started” “completed” • Parameter marshalling • Data consistency • Also in PVM, MPI etc • Exceptions/cancellation • Critical for stateful servers • Conservative vs. best effort • Other issues • Execution order • Security • Virtualizing communications • Performance/familiarity vs. semantic issues
Programming Models: CCA and H2O • Common Component Architecture • Component standard for HPC • Uses and provides ports described in SIDL • Support for scientific data types • Existing tightly coupled (CCAFFEINE) and loosely coupled, distributed (XCAT) frameworks • H2O • Well matched to CCA model
MOCCA implementation in H2O • Each component running in separate pluglet • Thanks to H2O kernel security mechanisms, multiple components may run without interfering • Two-level builder hierarchy • ComponentID: pluglet URI • MOCCA_Light: pure Java implementation (no SIDL)
Performance: Small Data Packets Factors: • SOAP header overhead in XCAT • Connection pools in RMIX
Large Data Packets • Encoding (binary vs. base64) • CPU saturation on Gigabit LAN (serialization) • Variance caused by Java garbage collection
Use Case 2: H2O + FT-MPI • Overall scheme: • H2O framework installed on computational nodes, or cluster front-ends • Pluglet for startup, event notification, node discovery • FT-MPI native communication (also MPICH) • Major value added • FT-MPI need not be installed anywhere on computing nodes • To be staged just-in-time before program execution • Likewise, application binaries and data need not be present on computing nodes • The system must be able to stage them in a secure manner
Staging FT-MPI runtime with H2O • FT-MPI runtime library and daemons • Staged from a repository (e.g. Web server) to the computational node upon user’s request • Automatic platform type detection; appropriate binary files are downloaded from the repository as needed • Allows users to run fault tolerant MPI programs on machines where FT-MPI is not pre-installed • Not needing login account to do so: using H2O credentials instead
Launching FT-MPI applications with H2O • Staging applications from a network repository • Uses URL code base to refer to a remotely stored application • Platform-specific binary transparently uploaded to a computational node upon client request • Separation of roles • Application developer bundles the application and puts it into a repository • The end-user launches the application, unaware of heterogeneity
Cluster 1 Cluster 2 App App Startup_d Startup_d Startup Startup pluglet pluglet Communication H2O proxy H2O proxy within cluster Communication App App across clusters Startup_d Startup_d Interconnecting heterogeneous clusters • Private, non-routable networks • Communication proxies on cluster front-ends route data streams • Local (intra-cluster) channels not affected • Nodes use virtual addresses at the IP level; resolved by the proxy
Initial experimental results • Proxied connection versus direct connection • Standard FT-MPI throughput benchmark was used • within a Gig-Ethernet cluster: proxies retain 65% of throughput
Summary • Virtualization in PDC • Devising appropriate abstractions • Balance pragmatics and performance vs. model cleanness • The Harness II Project • H2O kernel • Reconfigurability, by clients/tpr’s very valuable • RMIX communications framework • High level abstractions for control comms (native data comms) • Multiple programming model overlays • CCA, FT-MPI, PVM • Concurrent computing environments on demand