10 likes | 150 Views
gluepy : A Framework for Flexible Programming in Complex Grid Environments Ken Hironaka, Hideo Saito, Kei Takahashi , Kenjiro Taura (the University of Tokyo) { kenny , h_saito , kay , tau }@ logos.ic.i.u-tokyo.ac.jp
E N D
gluepy: A Framework for Flexible Programming in Complex Grid Environments Ken Hironaka, Hideo Saito, Kei Takahashi , KenjiroTaura (the University of Tokyo) {kenny, h_saito, kay, tau}@logos.ic.i.u-tokyo.ac.jp Package available from Home Page: www.logos.ic.i.u-tokyo.ac.jp/~kenny/gluepy • Automatic Overlay Construction on Grid • Construction Scheme: Steps for each peer • obtain endpoint information to other peers • attempt TCP connections to a selected few peers • Firewall-Cluster Peers • Automatic SSH-portforwarding • Adaptive routing on overlay [Perkins et al. 1997] • Failure Detection on Overlay • communication path is maintained for each RMI • Intermediate peers remember the next peer: Path Pointer • On failure of connection, error is returned along path • Overview • Grid-enabled distributed object oriented programming model • Distributed objects with implicit synchronization • Model that allows join/failure of nodes • Incorporate NAT/firewalled clusters via overlay • gluepy : “glue Python” • Distributed object library extension for Python • Implements our proposed programming model • Real Grid Applications on real Grid Environments • Over 900 real nodes across 9 clusters • Heterogeneous Network Settings (including NAT, firewalls) NAT Global IP SSH Attempt connection Firewall RMI handler Path pointer • Related Works • Grid-enabled Programming Models • Satin [Wrzesinska et al. 2006], Jojo [Nakada et al. 2004], Jojo2 [Aoki et al. 2006] • Distributed Objects on the Grid • ProActive [Huet et al. 2004], Ibis RMI [van Nieuwpoort, et al. 2005] • Wide-area Connection Management • SmartSockets [Maassen et al. 2007], MC-MPI [Saito et al. 2007] return error failure Evaluation Results Experimental Environment • Programming Model • Asynchronous RMIs (Remote Method Invocations) with Futures • any invocation may be made asynchronous • returns a future, a place holder • Serialization Semantics (Synchronization) • At most 1 running thread per object • At any given time, at most 1 thread can • execute an object’s method: the owner thread • eliminate race-conditions • If a thread blocks while in the method’s scope, • other threads are permitted to execute methods • on the object • eliminate deadlocks for common usage • Signals to Object • Any thread blocking in the object’s context • will unblock and return None • Runtime Node Joins • Need to obtaining reference to existing objects • A fully decentralized remote object lookup scheme • Node failure (RMI failure) detection • RMI failures are returned as Exceptions Global IPs H (316) • Figure of clusters used for experiments • The numbers denote CPU core count • Total of 9 clusters • Over 900 CPU Cores I (64) D (28) A (98) C (72) B (186) waiting threads owner thread object F (70) All packets dropped G (88) Th Th Th Th E (60) Private IPs Firewall Overlay Connectivity Simulation Master-Worker application with node joins/failures new owner thread object Th Th Th block Give-up Owner ship Th Connect attempts per Peer object • 3 Cluster Combinations • Global Peers: 384 Private Peers: 218 • Global Peers: 100 Private Peers: 218 • Global Peers: 28 Private Peers: 218 • Achieves high probability with ~20 connections per peer • Task distributing Master-Worker (10000 tasks) • New tasks to new workers via async. RMIs • Tasks given to failed workers are redistributed • By handling RMI failure exceptions • Master adapts to new nodes immediately, and completes all tasks in face of worker failures Th Th Th Th re-contest ownership Unblock Application • Parallel Permutation Flowshop Solver • A combination optimization problem • Given a sequence of n jobs that use m machines, • find a permutation of jobs with the shortest makespan • Finds the optimal solution by parallel branch and bound • Master divides the search space into sub-tasks • Worker periodically exchange latest bounds with master Master Example Master-Worker Excerpt class Master : def __init__(self): self.nodes= [] self.jobs= [] def nodeJoin(self , node): self.nodes.append(node) self.signal() def run (self): assigned = {} while True: while len(self.nodes)>0 and len(self.jobs)>0: node = self.nodes.pop() job = self.jobs.pop() f = node.doJob.future(job) assigned[f] = (node, job) readys = wait(assigned.keys()) if readys == None: continue for f in readys: node, job = assigned.pop(f) try: print ”done:”, f.get() self.nodes.append(node) except RemoteException, e: self.jobs.append(job) exchange_bound() doJob() Worker Signal thread blocking in master object Application • “Troubleshooting” Search Engine • Ever stuck debugging, or troubleshooting? • Re-rank google queries and give weight to pages for web-forums and solutions • Natural language processing and machine learning • Parallel Computing Backend • On-line Web-page parsing/analysis • Real-Time response for hundreds of ranked pages Atomic Section async. RMI, doJob() to idle workers Efficiency • Efficiency: • : num. of cores • : completion time • : calc. time per core • 90% efficiency with ~950 cores Block and wait for some results None returns when unblocked by signal Compute!! Compute!! Query: “vmwarekernel panic” backend retrieve results Exception raised on failure Atomic Section • Future Work • Application to much wider range of applications • Development of library package • A prototype package is available at Home Page!! Search Engine Compute!!