220 likes | 230 Views
This paper focuses on harnessing and managing remote storage in batch-pipelined and I/O intensive workloads, specifically in scientific and wide-area grid computing. The authors propose a solution called BAD-FS, a batch-aware distributed file system that leverages workload information with storage control to improve performance and simplify implementation.
E N D
Focus of work • Harnessing, managing remote storage • Batch-pipelined I/O intensive workloads • Scientific workloads • Wide-area grid computing
Batch-pipelined workloads • General properties • Large number of processes • Process and data dependencies • I/O intensive • Different types of I/O • Endpoint • Batch • Pipeline
Endpoint Endpoint Batch dataset Endpoint Pipeline Pipeline Batch dataset Batch-pipelined workloads Endpoint Endpoint Pipeline Pipeline Pipeline Pipeline Pipeline Pipeline Pipeline Endpoint Endpoint Endpoint Endpoint
Wide-area grid computing Internet Home storage
Cluster-to-cluster (c2c) • Not quite p2p • More organized • Less hostile • More homogeneity • Correlated failures • Each cluster is autonomous • Run and managed by different entities • An obvious bottleneck is wide-area Internet Home store How to manage flow of data into, within and out of these clusters?
Current approaches • Remote I/O • Condor standard universe • Very easy • Consistency through serialization • Prestaging • Condor vanilla universe • Manually intensive • Good performance through knowledge • Distributed file systems (AFS, NFS) • Easy to use, uniform name space • Impractical in this environment
BAD-FS • Solution: Batch-Aware Distributed File System • Leverages workload info with storage control • Detail information about workload is known • Storage layer allows external control • External scheduler makes informed storage decisions • Combining information and control results in • Improved performance • More robust failure handling • Simplified implementation
Practical and deployable • User-level; requires no privilege • Packaged as a modified Condor system • A Condor system which includes BAD-FS • General; glide-in works everywhere SGE SGE SGE SGE BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS BAD- FS SGE SGE SGE SGE Internet Home store
NeST NeST NeST NeST Jobqueue 1 2 3 4 BAD-FS == Condor ++ Compute node Compute node Compute node Compute node Condor startd Condor startd Condor Startd Condor startd BAD-FS BAD-FS BAD-FS 1) NeST storage management 3) Expanded Condor submit language 2) Batch-Aware Distributed File System 4) BAD-FS scheduler Job queue Home storage Condor DAGMan ++ Condor DAGMan
BAD-FS knowledge • Remote cluster knowledge • Storage availability • Failure rates • Workload knowledge • Data type (batch, pipeline, or endpoint) • Data quantity • Job dependencies
Control through lots • Abstraction that allows external storage control • Guaranteed storage allocations • Containers for job I/O • e.g. “I need 2 GB of space for at least 24 hours” • Scheduler • Creates lots to cache input data • Subsequent jobs can reuse this data • Creates lots to buffer output data • Destroys pipeline, copies endpoint • Configures workload to access lots
Knowledge plus control • Enhanced performance • I/O scoping • Capacity-aware scheduling • Improved failure handling • Cost-benefit replication • Simplified implementation • No cache consistency protocol
I/O scoping • Technique to minimize wide-area traffic • Allocate lots to cache batch data • Allocate lots for pipeline and endpoint • Extract endpoint • Cleanup Compute node Compute node AMANDA: 200 MB pipeline 500 MB batch 5 MB endpoint Steady-state: Only 5 of 705 MB traverse wide-area. Internet BAD-FS Scheduler
Capacity-aware scheduling • Technique to avoid over-allocations • Scheduler has knowledge of • Storage availability • Storage usage within the workload • Scheduler runs as many jobs as fit • Avoids wasted utilizations • Improves job throughput
Improved failure handling • Scheduler understands data semantics • Data is not just a collection of bytes • Losing data is not catastrophic • Output can be regenerated by rerunning jobs • Cost-benefit replication • Replicates only data whose replication cost is cheaper than cost to rerun the job • Can improve throughput in lossy environment
Simplified implementation • Data dependencies known • Scheduler ensures proper ordering • Build a distributed file system • With cooperative caching • But without a cache consistency protocol
Real workloads • AMANDA • Astrophysics study of cosmic events such as gamma-ray bursts • BLAST • Biology search for proteins within a genome • CMS • Physics simulation of large particle colliders • HF • Chemistry study of non-relativistic interactions between atomic nuclei and electrons • IBIS • Ecology global-scale simulation of earth’s climate used to study effects of human activity (e.g. global warming)
Setup 16 jobs 16 compute nodes Emulated wide-area Configuration Remote I/O AFS-like with /tmp BAD-FS Result is order of magnitude improvement Real workload experience
BAD Conclusions • Schedulers can obtain workload knowledge • Schedulers need storage control • Caching • Consistency • Replication • Combining this control with knowledge • Enhanced performance • Improved failure handling • Simplified implementation
For more information “Pipeline and Batch Sharing in Grid Workloads,” Douglas Thain, John Bent, Andrea Arpaci-Dusseau, Remzi Arpaci-Dussea, Miron Livny. HPDC 12, 2003. • http://www.cs.wisc.edu/condor/publications.html • Questions? “Explicit Control in a Batch-Aware Distributed File System,” John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dussea, Miron Livny. NSDI ‘04, 2004.