70 likes | 232 Views
Large Scale File Distribution Troy Raeder & Tanya Peters. The Problem. Distribute a large file to some number of machines useful to deploy new programs, distribute data Chirp_distribute was implemented last year and distribute files using a spanning tree
E N D
The Problem • Distribute a large file to some number of machines • useful to deploy new programs, distribute data • Chirp_distribute was implemented last year and distribute files using a spanning tree • Want to improve upon the existing methods to transfer files more efficiently. • Choke points exist – multiple machines will all transfer files through a single router/switch • Minimizing failures, including permissions errors
The Solution • Take advantage of network topology – transfer across routers and switches as soon as possible, and then machines in the same cluster transfer to each other. • Using traceroute, we build a graph that represents the network. This is done as needed and saved in a file which is loaded at run time. • Access Control Lists: if we know a source machine doesn’t have permissions to transfer to some target, don’t even try
Picking a Target: • Check if all clusters in the graph contain a copy of the file. • If some cluster does not, we copy to it. • Next, if some node within your cluster doesn't have the file, transfer to it. • Otherwise, pick some other node that doesn't have the file. • If a node is unable to transfer to nodes that don't have the file yet, it is removed from the list of possible sources.
Initial Results • Current version of algorithm doesn’t always do better • As expected, for smaller files and/or smaller number of hosts, overhead costs us • For larger files and/or number of hosts, things like timeouts can wash out relative gains.
What's Next... • Pick source & target more intelligently • If initial attempt to copy from some cluster A to cluster B fails, don't try transferring between these two clusters again unless no other possibilities exist. • Try and manage straggler transfers • Dynamically set timeout for transferring a single copy: set to some multiple of max or average transfer time seen so far. • The end result hopefully that we have a significant improvement over existing algorithm