John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Using Application Structureto Handle Failuresand Improve Performancein a Migratory File Service John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny WiND and Condor Project 14 April 2003

Disclaimer We have a lot of stuff to describe, so hang in there until the end!

Outline • Data Intensive Applications • Batch and Pipeline Sharing • Example: AMANDA • Hawk: A Migratory File Service • Application Structure • System Architecture • Interactions • Evaluation • Performance • Failure • Philosophizing

CPU Bound • SETI@Home, Folding@Home, etc... • Excellent application of dist comp. • KB of data, days of CPU time. • Efficient to do tiny I/O on demand. • Supporting Systems: • Condor • BOINC • Google Toolbar • Custom software.

I/O Bound • D-Zero data analysis: • Excellent app for cluster computing. • GB of data, seconds of CPU time. • Efficient to compute whenever data is ready. • Supporting Systems: • Fermi SAM • High-throughput document scanning • Custom software.

Batch Pipelined Applications a1 a2 a3 Pipeline Shared Data x y z x y z x y z b1 b2 b3 Pipeline Batch Shared Data data data data data c1 c2 c3 Batch Width

Example: AMANDA corsika_input.txt (4 KB) mmc ice tables (3 files, 3MB) NUCNUCCS GLAUBTAR EGSDATA3.3 QGSDATA4 (1 MB) corsika mmc_output.dat (126 MB) amasim_input.dat DAT (23 MB) corama amasim expt geometry (100s files, 500 MB) corama.out (26 MB) mmc_input.txt amasim_output.txt (5MB)

Computing Evironment • Clusters dominate: • Similar configurations. • Fast interconnects. • Single administrative domain. • Underutilized commodity storage. • En masse, quite unreliable. • Users wish to harness multiple clusters, but have jobs that are both I/O and CPU intensive.

Ugly Solutions • “FTP-Net” • User finds remote clusters. • Manually stages data in. • Submits jobs, deals with failures. • Pulls data out. • Lather, rinse, repeat. • “Remote I/O” • Submit jobs to a remote batch system. • Let all I/O come back to the archive. • Return in several decades.

What We Really Need • Access resources outside my domain. • Assemble your own army. • Automatic integration of CPU and I/O access. • Forget optimal: save administration costs. • Replacing remote with local always wins. • Robustness to failures. • Can’t hire babysitters for New Year’s Eve.

Hawk: A Migratory File Service • Automatically deploys a “task force” acorss an existing distributed system. • Manages applications from a high level, using knowledge of process interactions. • Provides dependable performance through peer-to-peer techniques. • Understands and reacts to failures using knowledge of the system and workloads.

Philsophy of Hawk “In allocating resources, strive to avoid disaster, rather than attempt to obtain an optimum.” - Butler Lampson

Why not AFS+Make? • Quick answer: • Distributed filesystems provide an unnecessarily strong abstraction that is unacceptably expensive to provide in the wide area. • Better answer after we explain what Hawk is and how it works.

Workflow Language 1 job a a.sub job b b.sub job c c.sub job d d.sub parent a child c parent b child d a b c d

Workflow Language 2 v1 Home Storage mydata volume v1 ftp://home/mydata mount v1 a /data mount v1 b /data volume v2 scratch mount v2 a /tmp mount v2 c /tmp volume v3 scratch mount v3 b /tmp mount v3 d /tmp v2 v3 a b c d

Workflow Language 3 v1 Home Storage mydata extract v2 x ftp://home/out.1 extract v3 x ftp://home/out.2 v2 v3 a b x x c d out.1 out.2

Mapping Logical to Physical • Abstract Jobs • Physical jobs in a batch system • May run more than once! • Logical “scratch” volumes • Temporary containers on a scratch disk. • May be created, replicated, and destroyed. • Logical “read” volumes • Striped across cooperative proxy caches. • May be created, cached, and evicted.

Starting System Node Node Node Node Node Node Node Node Node Node PBS Head Node Condor Pool Archive Match Maker Batch Queue Workflow Manager

Gliding In Master Master Master Master Master Master Proxy Proxy Proxy Proxy Proxy Proxy StartD StartD StartD StartD StartD StartD Glide-In Job Node Node Node Node Node Node Node Node Node Node PBS Head Node Condor Pool Archive Match Maker Batch Queue

Hawk Architecture Job Job Job App Flow Agent Agent Agent Coop Cache Coop Cache System Model Wide Area Caching StartD StartD StartD Proxy Proxy Proxy Archive Match Maker Batch Queue Workflow Manager

I/O Interactions StartD Job creat(“/tmp/outfile”); open(“/data/d15”); POSIX Library Interface Agent /tmp container://host5/120 /data cache://host5/archive/data Local Area Network Proxy Cont. 119 Cont. 120 Cooperative Block Cache Other Proxies foo tmpfile outfile bar baz Archive Match Maker Batch Queue Workflow Manager

Cooperative Proxies C t1: C B t2: C B A t3: C C B t4: C Discover Discover Discover C Hash Map Paths -> Proxies StartD StartD StartD Job Job Job Agent Agent Agent Proxy A Proxy B Proxy C Archive Match Maker Batch Queue Workflow Manager

Summary • Archive • Sources input data, chooses coordinator. • Glide-In • Deploy a “task force” of components. • Cooperative Proxies • Provide dependable batch read-only data. • Data Containers • Fault-isolated pipeline data. • Workflow Manager • Directs the operation.

Performance Testbed • Controlled testbed: • 32 550 MHZ dual-cpu cluster machines, 1 GB, SCSI disks, 100Mb/s ethernet. • Simulated WAN: restrict archive storage across router to 800 KB/s. • Also some preliminary tests on uncontrolled systems: • MFS over PBS cluster at Los Alamos • MFS over Condor system at INFN Italy.

Synthetic Apps Pipe Intensive Mixed Batch Intensive a a a 10 MB pipe 10 MB batch 5 MB pipe 5 MB batch b b b System Configurations

Pipeline Optimization

Everything Together

Network Consumption

Failure Handling

Real Applications • BLAST • Search tool for proteins and nucleotides in genomic databases. • CMS • Simulation of a high energy physics expt to begin operation at CERN in 2006. • H-F • Simulation of the non relativistic interactions between nuclei and electrons • AMANDA • Simulation of a neutrino detector buried in the ice of the South Pole.

Application Throughput

Related Work • Workflow management • Dependency managers: TREC, make • Private namespaces: UFO, db views • Cooperative caching: no writes. • P2P systems: wrong semantics. • Filesystems: overly strong

Why Not AFS+Make? • Namespaces • Constructed per-process at submit-time • Consistency • Enforced at the workflow level • Selective Commit • Everything tossed unless explicitly saved. • Fault Awareness • CPUs and data can be lost at any point. • Practicality • No special permission required.

Conclusions • Traditional systems build from the bottom up: this disk must have five nines, or we’re in big trouble! • MFS builds from the top down: application semantics drive system structure. • By posing the right problem, we solve the traditional hard problems of file systems.

For More Info... • Paper in progress... • Application study: • “Pipeline and Batch Sharing in Grid Workloads”, to appear in HPDC-2003. • www.cs.wisc.edu/condor/doc/profiling.ps • Talk to us! • Questions now?

John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny

Presentation Transcript

BENT SIND

Douglas

Douglas

NeST: Network Storage John Bent, Venkateshwaran V Miron Livny, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau

Andrea and Me

Instructor: John Sum Group3: Lee, Hoyo , Andrea

CSE 20211 Fundamentals of Computing Prof. Douglas Thain Fall 2011

John M. Douglas, Jr., MD Denver Public Health

Douglas

Miron Livny Computer Sciences Department University of Wisconsin-Madison miron@cs.wisc

Douglas

Douglas John Coughlan

Todd Tannenbaum (Miron Livny) (Alain Roy)

Miron Costin