250 likes | 396 Views
Maximizing Goodput via Co-scheduling Of CPU and Network Capacity. Miron Livny Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu (joint work with Jim Basney). Allocated CPU hours per user (6/21/98 - 9/3/98). 400,000 CPU hours in 73 days on
E N D
Maximizing Goodput viaCo-scheduling Of CPU and Network Capacity Miron Livny Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc.edu (joint work with Jim Basney)
Allocated CPU hours per user(6/21/98 - 9/3/98) 400,000 CPU hours in 73 days on 320 Desk-top machines of the UW-CS Condor pool (~17 hours per day per machine) Maximizing Goodput via Co-scheduling of CPU and Network Capacity
Memory CPU File System Remote Execution Challenge Remote Resource Customer File System* Executable Checkpoint Network Input Files Output Files *May be distributed. Maximizing Goodput via Co-scheduling of CPU and Network Capacity
Maximizing Goodput via Co-scheduling of CPU and Network Capacity
How useful is the allocated Time? Allocate Preempt X Placement Periodic Ckpt Periodic Ckpt Preempt Ckpt Remote I/O Wait and See Goodput = Allocation - Overhead Maximizing Goodput via Co-scheduling of CPU and Network Capacity
Goodput is the allocation time where the application makes forward progress overhead = Placement + Migration Periodic Checkpoints + Remote I/O +Wait and See Maximizing Goodput via Co-scheduling of CPU and Network Capacity
Placement • What: Transfer executable and checkpoint data • How much - Known in advance. • Executable: usually small • Checkpoint: application memory image • Can be large! (100MB+) • May include cached input data and intermediate file data • When: Triggered by Resource Manager when CPU is allocated Maximizing Goodput via Co-scheduling of CPU and Network Capacity
Migration • What: Transfer Checkpoint Data to file system or a hot standby. • How much: Known in advance • Workstation owner may limit time to migrate • Failure results in lost work • When: Initiated by workstation owner or triggered by Resource Manager to enforce priority order Maximizing Goodput via Co-scheduling of CPU and Network Capacity
Remote I/O • What: Application Input/Output data • Read input files. • Write intermediate results. • Read intermediate results. • Write final results. • How much: Application may know/tell. • When: Initiated by application read and write system calls during run. Maximizing Goodput via Co-scheduling of CPU and Network Capacity
Periodic Checkpoint • What: Transfer Checkpoint Data to file system. • How much: Known in advance. • When: Scheduled in advance by shadow. • reduce risk in case of a failed migration. • No deadline. • All remote resources are available. Maximizing Goodput via Co-scheduling of CPU and Network Capacity
Wait and See • What: Suspend application when resource is revoked • Wait and See if resource will become available shortly. • Shortens migration time limit. • Consumes local resources. • When: Initiated by owner activity • How long: Upper bound set by resource owner. Maximizing Goodput via Co-scheduling of CPU and Network Capacity
Application Application Agent Customer Agent Environment Agent Owner Agent Local Resource Management Resource High Throughput Computing Layers Maximizing Goodput via Co-scheduling of CPU and Network Capacity
Who Does What in the Condor Environment? • Matchmaker • Initiates allocations • Preempts (re-matches) to transfer allocation to higher priority customer. • Checkpoint Server(s) • Store checkpoints (may include data files). • File system (Unix, NFS, AFS) • Stores files. Maximizing Goodput via Co-scheduling of CPU and Network Capacity
Who does what? • Shadow: Application Resource Manager • Application-level scheduling • Acts a proxy for the application in the submit environment. • Owner Agent: Controls opportunistic resource • Owner may preempt application at any time. • Owner controls preemption policy. Maximizing Goodput via Co-scheduling of CPU and Network Capacity
Approachs for Maximizing Goodput • Co-matching (scheduling of network, server and CPU resources. (matchmaker) • Support high priority data transfers to/from checkpoint servers. (checkpoint server) • Localized checkpointing (shadow). Maximizing Goodput via Co-scheduling of CPU and Network Capacity
… approach • Plan in advance for pre-scheduled events.(external scheduler) • Reduce size of data to be transferred (checkpoint server and remote resource). • Monitor system goodput (all). Maximizing Goodput via Co-scheduling of CPU and Network Capacity
Challenges • Develop an effective model of the network and I/O capabilities of a Condor pool. • Obtain the information needed to build such a model. • Add co-matching of ClassAds to the matchmaking framework. • Develop a multi-resource consumption based priority scheme. Maximizing Goodput via Co-scheduling of CPU and Network Capacity
Matchmaker Co-matching • Problem: Bursty matchmaking causes network or server saturation • increases placement and checkpoint costs • slow placement results in underutilized CPUs • results in failed migrations • Approach: Don’t allow new matches to exceed predefined usage thresholds Maximizing Goodput via Co-scheduling of CPU and Network Capacity
…. Matchmaker Co-matching • Application requests an allocation which provides the best possible goodput • large data and checkpoint files require high bandwidth to checkpoint server. • balance cost of application placement and checkpoint overheads with (estimated) allocation time. Maximizing Goodput via Co-scheduling of CPU and Network Capacity
… Matchmaker Co-matching • Best Fit vs. First Fit • Match lower priority requests with smaller network requirements first toincrease cluster CPU utilization • Preempt one of these requests when you match a high priority request with a large network requirement. Maximizing Goodput via Co-scheduling of CPU and Network Capacity
Checkpoint Server support • Prioritize data streams • high priority: migration streams • low priority: checkpoint read and periodic checkpoint write streams • Schedule periodic checkpoints in advance to avoid bursts of network traffic. • Schedule graceful shutdowns in advance to avoid vacate failures. Maximizing Goodput via Co-scheduling of CPU and Network Capacity
Shadow support • Choose most efficient data access method per file • Locate checkpoint and file servers • Schedule periodic checkpoints in advance. Maximizing Goodput via Co-scheduling of CPU and Network Capacity
Minimize Data Size • compress checkpoints. • only checkpoint changes (diffs). • data staging. • checkpoint staging. • write checkpoint to local file system and schedule transfer when resources are available Maximizing Goodput via Co-scheduling of CPU and Network Capacity
Goodput Measurements • Goodput/Allocation ratio measures health of the system • detect problem resources • detect overloaded subnets • measure QOS per application • Checkpoint transfer statistics measure network usage • success rate • throughput Maximizing Goodput via Co-scheduling of CPU and Network Capacity
Very Large Objects on the Network Maximizing Goodput via Co-scheduling of CPU and Network Capacity