Nomad: A Scalable Operating System for Clusters of Uni and Multiprocessors

Nomad: A Scalable Operating System for Clusters of Uni and Multiprocessors Eduardo Pinheiro Ricardo Bianchini Rutgers University

Goals • Scalability • No centralization. • No dedicated nodes to tasks. • Ease of use • Single system image. Backward compatible. • Efficient and automatic management of all resources • CPU, Memories, I/O devices. • Fault tolerant • Resistant to individual node crashes. Redundant.

Applications Nomad Daemon Base Operating System Overview

Mechanisms • Single system image • Unique process identifiers across cluster. • Signal delivery is independent of process location. • Process creation automatically picks best node. • Efficient resource utilization • Load balancing by migration due to resource (CPU, memory or I/O bandwidth) exhaustion. • Implicit co-scheduling across nodes. • Co-scheduling on multiprocessors.

Mechanisms • Scalability • No centralization. Nodes are autonomous. • High throughput striped and randomized file system (software RAID). • Need for extra intra-cluster communication obviated by piggybacked load dissemination information via file system messages.

Mechanisms • Fault Tolerance • Periodic checkpoints to stable storage. • If faults occur, applications can be restarted with minimum losses. • Redundant file system is capable of operating with up to one faulty disk/node. Recovery happens online.

Results

Results • Simulated Results for Load Balancing Obs: Overdemand time is due to sum of CPU, memory and I/O demands and is expressed in seconds.

Publications & Future Work • Eduardo Pinheiro and Ricardo Bianchini, “Nomad: An Efficient Operating System for Clusters of Uni and Multiprocessors", In Proceedings of the 1st IEEE Computer Society International Workshop on Cluster Computing (IWCC'99), Melbourne, Australia, December 1999. • Eduardo Pinheiro "Nomad, a Scalable Operating System for Clusters of Uni and Multiprocessors", XIII Dissertation Thesis Contest (CTD2000), Curitiba, PR, Brazil, July 16-21. Best MSc thesis of 1999. Future Work: • Explore the use user-level protocols (VIA) for communication between daemons. • Explore the use of remote memory writes in more aggressively managing the cluster resources. Award Winning

Nomad: A Scalable Operating System for Clusters of Uni and Multiprocessors