480 likes | 596 Views
Internet-Based TSP Computation with Javelin++ Michael Neary & Peter Cappello Computer Science, UCSB. Introduction Goals. Service parallel applications that are: Large : too big for a cluster Coarse-grain : to hide communication latency Simplicity of use
E N D
Internet-Based TSP Computation with Javelin++Michael Neary & Peter CappelloComputer Science, UCSB
IntroductionGoals • Service parallel applications that are: • Large: too big for a cluster • Coarse-grain: to hide communication latency • Simplicity of use • Design focus: decomposition [composition] of computation. • Scalable high performance • despite large communication latency • Fault-tolerance • 1000s of hosts, each dynamically [dis]associates.
IntroductionSome Applications • Search for extra-terrestrial life • Computer-generated animation • Computer modeling of drugs for: • Influenza • Cancer • Reducing chemotherapy’s side-effects • Financial modeling • Storing nuclear waste
Outline • Architecture • Model of Computation • API • Scalable Computation • Experimental Results • Conclusions & Future Work
Architecture Basic Components Clients Brokers Hosts
Architecture Broker Discovery B B B Broker Naming System B B B H B B B
Architecture Broker Discovery B B B Broker Naming System B B B H B B B
Architecture Broker Discovery B B B Broker Naming System B B B H B B B
Architecture Broker Discovery B B B Broker Naming System B B B H B B B PING (BID?)
Architecture Broker Discovery B B B Broker Naming System B B B H B B B
ArchitectureNetwork of Broker-Managed Host Trees • Each broker manages a tree of hosts
ArchitectureNetwork of Broker-Managed Host Trees • Brokers form a network
ArchitectureNetwork of Broker-Managed Host Trees • Brokers form a network • Client contacts broker
ArchitectureNetwork of Broker-Managed Host Trees • Brokers form a network • Client contacts broker • Client gets host trees
Scalable ComputationDeterministic Work-Stealing Scheduler addTask( task ) getTask( ) Task container stealTask( ) HOST
Scalable ComputationDeterministic Work-Stealing Scheduler Task getWork( ) { if ( my deque has a task ) return task; else if ( any child has a task ) return child’s task; else return parent.getWork( ); } CLIENT HOSTS
Models of Computation • Master-slave • AFAIK all proposed commercial applications • Branch-&-bound optimization • A generalization of master-slave.
0 7 2 3 8 6 10 4 3 8 7 12 10 9 10 Models of ComputationBranch & Bound UPPER = LOWER = 0 0
0 7 2 3 8 6 10 4 3 8 7 12 10 9 10 Models of ComputationBranch & Bound UPPER = LOWER = 2 0 2
0 7 2 3 8 6 10 4 3 8 7 12 10 9 10 Models of ComputationBranch & Bound UPPER = LOWER = 3 0 2 3
0 7 2 3 8 6 10 4 3 8 7 12 10 9 10 Models of ComputationBranch & Bound UPPER = 4 LOWER = 4 0 2 3 4
0 7 2 3 8 6 10 4 3 8 7 12 10 9 10 Models of ComputationBranch & Bound UPPER = 3 LOWER = 3 0 2 3 4 3
0 7 2 3 8 6 10 4 3 8 7 12 10 9 10 Models of ComputationBranch & Bound UPPER = 3 LOWER = 6 0 2 3 6 4 3
0 0 7 2 7 2 3 8 6 10 3 6 4 3 8 7 12 10 9 10 4 3 Models of ComputationBranch & Bound UPPER = 3 LOWER = 7
0 7 2 3 6 4 3 Models of ComputationBranch & Bound • Tasks created dynamically • Upper bound is shared • To detect termination: scheduler detects tasks that have been: • Completed • Killed (“bounded”)
API public class Host implements Runnable { . . . public void run() { while ( (node = jDM.getWork()) != null ) { if ( isAtomic() ) compute(); // search space; return result else { child = node.branch(); // put children in child array for (int i = 0; i < node.numChildren; i++) if ( child[i].setLowerBound() < UpperBound ) jDM.addWork( child[i] ); //else child is killed implicitly } } }
API private void compute() { . . . boolean newBest = false; while ( (node = stack.pop()) != null ) { if ( node.isComplete() ) if ( node.getCost() < UpperBound ) { newBest = true; UpperBound = node.getCost(); jDM.propagateValue( UpperBound ); best = Node( child[i] ); } else { child = node.branch(); for (int i = 0; i < node.numChildren; i++) if ( child[i].setLowerBound() < UpperBound ) stack.push( child[i] ); //else child is killed implicitly } } if ( newBest ) jDM.returnResult( best ); } }
Scalable ComputationWeak Shared Memory Model • Slow propagation of bound affects performance not correctness. Propagate bound
Scalable ComputationWeak Shared Memory Model • Slow propagation of bound affects performance not correctness. Propagate bound
Scalable ComputationWeak Shared Memory Model • Slow propagation of bound affects performance not correctness. Propagate bound
Scalable ComputationWeak Shared Memory Model • Slow propagation of bound affects performance not correctness. Propagate bound
Scalable ComputationWeak Shared Memory Model • Slow propagation of bound affects performance not correctness. Propagate bound
Scalable ComputationFault Tolerance via Eager Scheduling When: • All tasks have been assigned • Some results have not been reported • A host wants a new task Re-assign a task! • Eager scheduling tolerates faults & balances the load. • Computation completes, if at least 1 host communicates with client.
0 7 2 3 6 4 3 Scalable ComputationFault Tolerance via Eager Scheduling • Scheduler must know which: • Tasks have completed • Nodes have been killed • Performance balance • Centralized schedule info • Decentralized computation
0 7 2 3 8 6 10 4 3 8 7 12 10 9 10 Experimental Results Example of a “bad” graph
Conclusions • Javelin 2 relieves designer/programmer managing a set of [Inter-] networked processors that is: • Dynamic • Faulty • A wide set of applications is covered by: • Master-slave model • Branch & bound model • Weak shared memory performs well. • Use multicast (?) for: • Code distribution • Propagating values
Future Work • Improve support for long-lived computation: • Do not require that the client run continuously. • A dag model of computation • with limited weak shared memory.
Future WorkJini/JavaSpaces Technology “Continuously” disperse Tasks among brokers via a physics model H H H TaskManager aka Broker H H H H H
Future WorkJini/JavaSpaces Technology • TaskManager uses persistent JavaSpace • Host management: trivial • Eager scheduling: simple • No single point of failure • Fat tree topology
Future WorkAdvanced Issues • Privacy of data & algorithm • Algorithms • New computational complexity model “Minimize” communication between machines • N-body problem, … • Accounting: Associate specific work with specific host • Correctness • Compensation (how to quantify?) • Create international open source organization • System infrastructure • Application codes
0 0 7 7 2 2 3 3 8 8 6 6 10 10 4 4 3 3 8 8 7 7 12 12 10 10 9 9 10 10 Models of ComputationBranch & Bound UPPER = 3 LOWER = 0
ArchitectureBroker Name Service (BNS) BNS 1. Register with BNS BROKER HOST
ArchitectureBroker Name Service (BNS) BNS 1. Register with BNS BROKER 2. Get broker list HOST
ArchitectureBroker Name Service (BNS) BNS 1. Register with BNS BROKER 2. Get broker list HOST 3. Ping brokers on list
ArchitectureBroker Name Service (BNS) BNS 1. Register with BNS BROKER 2. Get broker list 4. Connect to selected broker HOST 3. Ping brokers on list