210 likes | 337 Views
NetSolve. Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory http://www.cs.utk.edu/netsolve. Objectives. Harnessing vast computational resources on the network Hardware Software Convenient for scientific computing community
E N D
NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory http://www.cs.utk.edu/netsolve
Objectives • Harnessing vast computational resources on the network • Hardware • Software • Convenient for scientific computing community • Reducing installation and programming overhead • Masking complexity related to distributed computing
Data Data Code Code Server Client Computation on the server Computation-Sharing Models Proxy Computing
Computation-Sharing ModelsCode Shipping Code Code Data Client Server Computation on the client
Computation-Sharing ModelsRemote Computation Data Data Code Client Server Computation on the server
Design issues • Platform independence to accommodate heterogeneity • User friendly • Extensibility • Load balancing • Fault tolerance
NetSolve Architecture “OS” Resources
NetSolve Client Interface C, Fortran, Java, Matlab, and Mathematica >> a = rand(100); b= rand(100,1); >> x = netsolve(’ax = b’, a, b); >> a = rand(100); b= rand(100,1); >> request = netsolve_nb (’send’, ’ax = b’, a, b); >> x = netsolve_nb(’probe’, request); Not ready >> x= netsolve_nb(’wait’, request);
NetSolve Wrappers • Problem description file for extensibility @PROBLEM ipars @INCLUDE ”ipars.h” @LIB /home/user/lib/libipars.a @DECRIPTION Parallel Sub-Surface Flow Simulator @INPUT 2 @OBJECT STRING CHAR model @OBJECT FILE CHAR infile • Compiled into wrappers around scientific libraries • XDR for platform-independent data transfer
NetSolve Load Balancing • Assigning a task to the “best” machine • Establishing a performance model Network delay, server properties, task properties • Measuring and monitoring dynamic system states • Load balancing at a finer granularity • Parallelism through non-blocking interface • Task migration
NetSolve Fault Tolerance • Inter-server fault tolerance Fault tolerance among NetSolve servers • Intra-server fault tolerance Fault tolerance within a NetSolve server
NetSolve Fault Tolerance Inter-server Fault Tolerance Performed by NetSolve agents • Basic approach • Failure detection + task reallocation • Overload detection + task migration • Introducing NetSolve storage servers • Store checkpoints or any information related to fault tolerance (must be platform-independent) • No reliance on failed or overloaded server for task migration
NetSolve Fault ToleranceIntra-server Fault Tolerance • Not a new problem • Could be invisible to NetSolve • Can take advantage of platform-specific features for fault tolerance • Possible integration with inter-server fault tolerance
Diskless Checkpointing Checksums and Reverse Computation • Diskless checkpointing eliminates the need for stable storage • N servers + a checkpointing server • At any point, consistent checkpoints taken at N servers (stored in memory) • A checksum of checkpoints stored at the checkpointing server • Rollback using reverse computation • State recovery using the checksum
Applications • MCell with NetSolve Large code, small data • Matlab with NetSolve Tradeoffs between parallelism and overhead • IPARS with NetSolve • ImageVision with NetSolve
Conclusion • An interesting infrastructure for sharing computational resources Both software and hardware • Convenience, performance, and reliability • Playground for fault tolerance Both general and specific