Cluster Computing on the Fly:

Cluster Computing on the Fly: Peer-to-Peer Scheduling of Idle Cycles in the Internet Virginia Lo, Daniel Zappala, Dayi Zhou, Shanyu Zhao, and Yuhong Liu Network Research Group University of Oregon

CCOF Motivation • A variety of users and their applications need additional computational resources • Many machine throughout the Internet lie idle for large periods of time • Many users are willing to donate cycles • How to provide cycles to the widest range of users? (beyond institutional barriers)

CCOF Scenario #1 • Chess hobbyist want to test her chess program • She only has a PC at home • She joins the chess interest group cycle-sharing community and discovers hosts who will run her chess state space search algorithm for a few weeks

CCOF Scenario #2 • Experiments with network game due in a week to meet conference deadline • Planet Lab overloaded • Network Research Group machines overloaded • Requests for hosts go out to machines in the department, campus, colleagues at other universities, personal friends, and general donors

CCOF Goals and Assumptions • Cycle sharing in an open peer-to-peer environment • Application-specific scheduling • Long term fairness • Hosts retain local control, sandbox

Cycle Sharing Applications Four classes of applications that can benefit from harvesting idle cycles: • Infinite workpile • Workpile with deadlines • Tree-based search • Point-of-Presence (PoP)

Infinite workpile • Consume huge amounts of compute time • Master-slave model • “Embarrassingly parallel”: no communication among hosts • Ex: SETI@home, Stanford Folding, etc.

Workpile with deadlines • Similar to infinite workpile but more moderate • Must be completed by a deadline (days or weeks) • Some capable of increasingly refined results given extra time • Ex: simulations with a large parameter space, ray tracing, genetic algorithms

Tree-based Search • Tree of slave processes rooted in single master node • Dynamic growth as search space is expanded • Dynamic pruning as costly solutions are abandoned • Low amount of communication among slave processes to share lower bounds • Ex: distributed branch and bound, alpha-beta search, recursive backtracking

Point-of-presence • Minimal consumption of CPU cycles • Require placement of application code dispersed throughout the Internet to meet specific location, topological distribution, or resource requirements • Ex: security monitoring systems, traffic analysis systems, protocol testing, distributed games

CCOF Architecture

CCOF Architecture • Cycle sharing communities based on factors such as interest, geography, performance, trust, or generic willingness to share. • Span institutional boundaries without institutional negotiation • A host can belong to more than one community • May want to control community membership

CCOF Architecture • Application schedulers to discover hosts, negotiate access, export code, and collect and verify results. • Application-specific (tailored to needs of application) • Resource discovery • Monitors jobs for progress; checks jobs for correctness • Kills or migrates jobs as needed

CCOF Architecture (cont.) • Local schedulers enforce local policy • Run in background mode v. preempt when user returns • QoS through admission control and reservation policies • Local machine protected through sandbox • Tight control over communication

CCOF Architecture (cont.) • Coordinated scheduling • Across local schedulers, across application schedulers • Enforce long-term fairness • Enhance resource discovery through information exchange

CCOF Preliminary Work • Wave Scheduler • Resource discovery experiments • Quizzes for Correctness • Point-of-Presence Scheduler

Wave Scheduler • Well-suited for workpile with deadlines • Provides on-going access to dedicated cycles by following night timezones around the globe • Uses a CAN-based overlay to organize hosts by timezone

Wave Scheduler

Resource Discovery(Zhou and Lo, to appear WGP2P’04 at CC-Grid ‘04) • Highly dynamic environment (hosts come, go) • Hosts maintain profiles of blocks of idle time Four basic search methods • Rendezvous points • Host advertisements • Client expanding ring search • Client random walk search

Resource Discovery • Rendezvous point best high job completion rate and low msg overhead, but favors large jobs under heavy workloads • ==> coordinated scheduling needed for long term fairness

CCOF Verification Goal: Verify correctness of returned results for workpile and workpile with deadline • Quizzes = easily verifiable computations that are indistinguishable from the actual work • Standalone quiz v. Embedded quizzes • Quiz performance stored in reputation system • Quizzes v. replication

Point-of-Presence Scheduler • Scalable protocols for identifying selected hosts in the community overlay network such that each ordinary node is k-hops from C of the selected hosts • (C,k) dominating set Problem • Useful for leader election, rendezvous point placement, monitor location, etc.

CCOF Dom(C,k) Protocol • Round 1: Each node says HI to k-hop neighbors <Each node knows size of its own k-hop neighborhood> • Round 2: Each node sends size of its k-hop neighborhood to all its neighbors. <Each node knows size of all nbrs k-hop nbrhoods.> • Round 3: If a node is maximal among its nbrhood, it declares itself a dominator and notifies all nbrs. <Some nodes hear from some dominators, some don’t> For those not yet covered by C dominators, repeat Rounds 1-3 excluding current dominators, until all nodes covered.

CCOF Research Issues • Incentives and fairness • What incentives are needed to encourage hosts to donate cycles? • How to keep track of resources consumed v. resources donated? • How to prevent resource hogs from taking an unfair share? • Resource discovery • How to discover hosts in a highly dynamic environment (hosts come and go, withdraw cycles, fail) • How to discover hosts that can be trusted, that will provide the needed resources?

CCOF Research Issues • Verification, trust, and reputation • How to check returned results? • How to catch malicious or misbehaving hosts that change results with low frequency? • Which reputation system? • Application-based scheduling • How does trust and reputation influence scheduling? • How should a host decide from whom to accept work?

CCOF Research Issues • Quality of service and performance monitoring • How to provide local admission control? • How to evaluate and provide QoS - guaranteed versus predictive service? • Security • How to prevent attacks launched from guest code running on the host? • How to prevent denial of service attacks in which useless code occupies many hosts

Related Work • Systems most closely resembling CCOF SHARP (Fu, Chase, Chun, Schwab, Vahdat, 2003) Partage, Self-organizing Flock of Condors (Hu, Butt, Zhang, 2003) BOINC (Anderson, 2003) - limited to donation of cycles to workpile) • Resource discovery (Iamnitchi and Foster, 2002); Condor matchmaking • Load sharing within and across institutions Condor, Condor Flocks, Grid computing • Incentives and Fairness See Berkeley Workshop on Economics of P2P Systems OurGrid (Andrade, Cirne, Brasileiro, Roisenberg, 2003) • Trust and Reputation EigenRep (Kamvar, Schlosser, Garcia-Molina, 2003); TrustMe(Singh and Liu, 2003)

Cluster Computing on the Fly: