320 likes | 434 Views
Migratory File Services for Batch-Pipelined Workloads. John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny WiND and Condor Projects 6 May 2003. How to Run a Batch-Pipelined Workload?. in.1. in.2. in.3. Batch Shared Data. a1. a2. a3. pipe.1. pipe.2.
E N D
Migratory File ServicesforBatch-Pipelined Workloads John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny WiND and Condor Projects 6 May 2003
How to Run aBatch-Pipelined Workload? in.1 in.2 in.3 Batch Shared Data a1 a2 a3 pipe.1 pipe.2 pipe.3 b1 b2 b3 out.1 out.2 out.3
Cluster-to-Cluster Computing Grid Engine Cluster The Internet PBS Cluster Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Home System Condor Pool Node Node Node Node Archive Node Node Node Node Node Node Node Node Node Node Node Node
How to Run aBatch-Pipelined Workload? • “Remote I/O” • Submit jobs to a remote batch system. • Let all I/O come directly home. • Inefficient if re-use is common. • (( But perfect if no data sharing! )) • “FTP-Net” • User finds remote clusters. • Manually stages data in. • Submits jobs, deals with failures. • Pulls data out. • Lather, rinse, repeat.
Hawk: A Migratory File Servicefor Batch-Pipelined Workloads • Automatically deploys a “task force” across multiple wide-area systems. • Manages applications from a high level, using knowledge of process interactions. • Provides dependable performance with peer-to-peer techniques. (( Locality is key! )) • Understands and reacts to failures using knowledge of the system and workloads.
Dangers • Failures • Physical: Networks fail, disks crash, CPUs halt. • Logical: Out of space/memory, lease expired. • Administrative: You can’t use cluster X today. • Dependencies • A comes before C and D, which are simultaneous. • What do we do if the output of C is lost? • Risk vs Reward • A gamble: Staging input data to a remote CPU. • A gamble: Leaving output data at a remote CPU.
Batch-Pipelined Workload a1 a2 a3 Hawk b1 b2 b3 Hawk Hawk Hawk c1 c2 c3 Hawk Hawk Hawk Hawk Hawk Hawk Hawk Hawk i1 i2 i3 Hawk In Action Grid Engine Cluster The Internet PBS Cluster a2 b2 c2 Node Node Node o2 i2 a3 b3 c3 Node Node Node Node Node Node o3 i3 Node Node Node Node Node Node Home System Condor Pool a1 b1 c1 Node Node Node Node Archive i1 o1 Node Node Node Node Node Node Node Node o1 o2 o3 Node Node Node Node
Workflow Language 1(Start With Condor DAGMan) job a a.condor job b b.condor job c c.condor job d d.condor parent a child c parent b child d a b c d
v1 Archive Storage mydata Workflow Language 2 volume v1 ftp://archive/mydata mount v1 a /data mount v1 b /data volume v2 scratch mount v2 a /tmp mount v2 c /tmp volume v3 scratch mount v3 b /tmp mount v3 d /tmp v2 v3 a b c d
v1 Archive Storage mydata Workflow Language 3 extract v2 x ftp://home/out.1 extract v3 x ftp://home/out.2 v2 v3 a b x x c d out.1 out.2
Mapping the Workflow tothe Migratory File System • Abstract Jobs • Become jobs in a batch system • May start, stop, fail, checkpoint, restart... • Logical “scratch” volumes • Become temporary containers on a scratch disk. • May be created, replicated, and destroyed... • Logical “read” volumes • Become blocks in a cooperative proxy cache. • May be created, cached, and evicted...
System Components Node Node Node Node Node Node Node Node Node Node PBS Cluster Condor Pool Archive Condor MM Condor SchedD Workflow Manager
Master Master Master Master Master Master Proxy Proxy Proxy Proxy Proxy Proxy StartD StartD StartD StartD StartD StartD Glide-In Job Gliding In Node Node Node Node Node Node Node Node Node Node PBS Cluster Condor Pool Archive Condor MM Condor SchedD
System Components Node Node Node Node Node Proxy Proxy Proxy StartD StartD StartD Node Node Node Node Node Proxy Proxy StartD StartD PBS Head Node Condor Pool Archive Condor MM Condor SchedD Workflow Manager
Cooperative Proxies Node Node Node Node Node Proxy Proxy Proxy StartD StartD StartD Node Node Node Node Node Proxy Proxy StartD StartD PBS Head Node Condor Pool Archive Condor MM Condor SchedD Workflow Manager
System Components Node Node Node Node Node Proxy Proxy Proxy StartD StartD StartD Node Node Node Node Node Proxy Proxy StartD StartD PBS Head Node Condor Pool Archive Condor MM Condor SchedD Workflow Manager
StartD StartD StartD StartD StartD Batch Execution System Node Node Node Node Node Proxy Proxy Proxy Node Node Node Node Node Proxy Proxy PBS Head Node Condor Pool Archive Condor MM Condor SchedD Workflow Manager
Node Node Node Node Node Proxy Proxy StartD StartD Node Node Node Node Node Proxy Proxy StartD StartD PBS Head Node Condor Pool System Components Proxy StartD Archive Condor MM Condor SchedD Workflow Manager
Proxy StartD Archive Workflow Manager Detail Condor MM Condor SchedD Workflow Manager
Proxy StartD Proxy StartD Archive Archive Condor MM Condor SchedD Workflow Manager
Container 120 /mydata Create Container d15 d16 Proxy StartD Coop Block Input Cache Local Area Network Wide Area Network Archive Condor MM Condor SchedD Workflow Manager
POSIX Library Interface /mydata Container 120 Agent /tmp cont://host5/120 /data cache://host5/archive/mydata Execute Job d15 d16 Proxy StartD Job Coop Block Input Cache open(“/data/d15”); creat(“/tmp/outfile”); Local Area Network outfile Wide Area Network Archive Condor MM Condor SchedD Workflow Manager
Container 120 /mydata outfile Extract Output Proxy StartD Job Completed Coop Block Input Cache Local Area Network Wide Area Network Archive Condor MM Condor SchedD Workflow Manager d15 out65 d16
Container 120 /mydata outfile Delete Container Proxy StartD Job Completed Coop Block Input Cache Local Area Network Wide Area Network Archive Condor MM Condor SchedD Workflow Manager d15 out65 d16
/mydata Proxy StartD Job Completed Coop Block Input Cache Local Area Network Container Deleted Wide Area Network Archive Condor MM Condor SchedD Workflow Manager d15 out65 d16
Fault Detection and Repair • The proxy, startd, and agent detect failures: • Job evicted by machine owner. • Network disconnection between job and proxy. • Container evicted by storage owner. • Out of space at proxy. • The workflow manager knows the consequences: • Job D couldn’t perform it’s I/O. • Check: Are volumes V1 and V3 still in place? • Aha: Volume V3 was lost -> Run B to create it.
Performance Testbed • Controlled “remote” cluster: • 32 cluster nodes at UW. • Hawk submitter also at UW. • Connected by a restricted 800 Kb/s link. • Also some preliminary tests on uncontrolled systems: • Hawk over PBS cluster at Los Alamos • Hawk over Condor system at INFN Italy.
Rollback Cascading Failure Failure Recovery
A Little Bit of Philosophy • Most systems build from the bottom up: • “This disk must have five nines, or else!” • MFS works from the top down: • “If this disk fails, we know what to do.” • By working from the top down, we finesse many of the hard problems in traditional filesystems.
Future Work • Integration with Stork • P2P Aspects: Discovery & Replication • Optional Knowledge: Size & Time • Delegation and Disconnection • Names, names, names: • Hawk – A migratory file service. • Hawkeye – A system monitoring tool.
job data job data Let Hawk juggle your work! jobs data ? Feeling overwhelmed?