240 likes | 440 Views
New Algorithms for Planning Bulk Transfer via Internet and Shipping Networks. Brian Cho Indranil Gupta University of Illinois at Urbana-Champaign. Motivation: Ad-hoc Data Processing. Data-intensive research on OpenCirrus Federated cloud: diverse geographic locations
E N D
New Algorithms for Planning Bulk Transfervia Internet and Shipping Networks Brian Cho Indranil Gupta University of Illinois atUrbana-Champaign
Motivation: Ad-hoc Data Processing • Data-intensive research on OpenCirrus • Federated cloud: diverse geographic locations • Data scale of TBs • Limited wide area bandwidth is a big bottleneck : Can take days or weeks to transfer over internet [Garfinkel 07] • Success story: Washington Post • Hillary Clinton White House schedule • Released as 17,481 pages non-searchable PDF images • Convert to searchable text and deliver to newsroom within the same news cycle • Done within 26 hours with Amazon AWS • Pay for bandwidth and computer usage
Bulk Transfer Options • Internet Transfer • Grid: [GridFTP] • PlanetLab: [CoBlitz 06] • Disk Shipping Transfer • [Jim Gray 03] • [PostManet 04] • [DOT 06] • Amazon AWS Import/Export • Pandora (People and networks moving data around) • First ever solution to transfer data cooperatively between multiple sources with internet and shipping edges • Produce optimal transfer plans that obey time deadlines and minimize dollar cost • Better than internet-only and shipping-only strategies
Option 1: Internet Transfer 5-20 Mbps 1TB: 5-20 days $0.10 per GB Computation Provider (Amazon) Data Source (CMU) No Cost Data Source (Illinois)
Option 2: Disk Shipping Transfer Overnight: $60 per Disk Two-Day: $30 per Disk Ground: $10 per Disk Disk Interface 40 MB/s $0.02 per GB $80 per Disk Computation Provider (Amazon) Data Source (CMU) Overnight: $40 per Disk Two-Day: $15 per Disk Ground: $5 per Disk Data Source (Illinois) Overnight: $50 per Disk Two-Day: $25 per Disk Ground: $5 per Disk
Cooperative Transfer Solutions • Good solutions • Meet deadlines • Minimize dollar cost • Complexity • Global scale • Many strategies • Collaboration helps • How to find the best solution? Open Cirrus Sites
Example: Minimize Dollar Cost 0.8 TB Data Source A Cloud Service Provider Total Cost: $125 Total Time: 20 Days 15 Days No Cost Data Source B Loading: $40 Handling: $80 5 Days . 1.2 TB Ground: $5 14 hours
Example: Meet Deadline (3 days)while Minimizing Dollar Cost 0.8 TB Data Source A Cloud Service Provider 1 Day Total Cost: $210 Total Time: 3 Days Overnight: $40 6 hours Data Source B Loading: $40 Handling: $80 1 Day . 1.2 TB Overnight: $50 . 14 hours
Outline • Motivation • Problem Formulation • Graph Model • Flow Over Time • Solution: Pandora • Experimental Results • Conclusion
Graph Model: Internet Links Capacity (Mb/s) Cost ($/GB) Transit time (almost instantaneous) Incoming/ Outgoing BW inet_out inet_out inet_in inet_in Site A Site B
Graph Model: Shipment Links Capacity (Mb/s) Cost ($/GB) Transit time (almost instantaneous) Incoming/ Outgoing BW inet_out inet_out inet_in inet_in ship_in ship_in Site A Site B Capacity (almost infinite) Cost: Shipping and Handling ($/Disk) Transit time (Hrs) Disk Interface BW e.g., 40 MB/s Cost: Loading ($/GB)
Data Transfer Over Time • Goal: Meet time deadline T while minimizing dollar cost C • Hard problem on graph with both Internet and Shipment links • NP-Hard • Formal problem and proof in paper • Solution: Pandora computes optimal and approximate solutions
Solution: Pandora Overview • Transform into static time-expanded network • Decomposition of shipping edges • Solve min-cost flow on static network • Mixed Integer Program • Optimizations to reduce computation time
Time-expanded Network • Intuitively, incorporate time into graph to create an extended graph representation • Make T=deadlinecopies of each vertex • Draw edges according to transit time • Draw holdover edges • [Ford Fulkerson 58] • Disk shipment represented as time-expanded network τ = 3 τ = 1 T = 5 time
Decomposed Shipping Edges • Decompose shipping edges to fixed cost edges • Transit time • Fixed cost • Capacity capacity = 2 TB cap = 2 TB cost = $130 cost = $110 cost = $100 capacity = 2 TB
Solution: Min-cost Flow Calculation using Mixed-Integer Program • Fixed-cost edges make min-cost flow calculation NP-Hard • Mixed-Integer Program (MIP) • Binary variable yedefined on fixed-cost edges • Goal: Minimize dollar cost • Subject to • Capacity constraints (flowe ≤ capacitye ∙ ye) • Conservation of flow • Demands of sources and sink • Proof of NP-Hardness and formal MIP in paper
Optimizations: Overview • Size of MIP grows linearly with deadline T • Worst-case running time grows exponentially with T • Reduce size of the MIP • Reduce number of shipment edges • Δ -condensed time-expanded networks • More optimizations in paper
Optimizations: Reduce numberof shipment edges • Can remove redundant shipment edges • Example: • Overnight shipment sent anytime before 4pm will arrive at destination at 8am 8am 7am 4pm 3pm 2pm 1pm noon
Optimization: Δ-condensedTime-expanded Network • Each batch of consecutive Δ time units condensed into one virtual time unit • Solution has • Minimum cost • Deadline approximation depending on Δ • More details in paper • [Fleischer Skutella 07] Δ = 2
Experimental Setup • Trace-driven • Wrote scripts to communicate with FedEx web services: queried package rates and destination time • Internet BW from PlanetLab measurements • GNU Linear Programming Kit (GLPK)
Experimental Results:8 sources, 0.25 TB per node, Heterogeneous BW • Direct Internet • Cost: $200 • Time: 280 hrs • Cannot take advantage of heterogeneous bandwidth • Direct Overnight • Cost: $1,500 • Time: 38hrs • Cannot fill disks to capacity 4 5 3 6 2 7 1 8 x 8 Width proportional to BW t 0.25 TB
Experimental Results:8 sources, 0.25 TB per node, Heterogeneous BW • Pandora Deadline=96hrs • Cost: $183 • Time: < 96 hrs • Direct Internet • Cost: $200 • Time: 280 hrs • Cannot take advantage of heterogeneous bandwidth • Direct Overnight • Cost: $1,500 • Time: 38hrs • Cannot fill disks to capacity 3 2 4 1 5 6 0.06 TB 0.08 TB 7 8 t 0.14 TB 1.92 TB
Experimental Results: Optimizations • Reducing shipment edges decreases computation time • Using Δ-condensed time-expanded networks decreases computation time • Deadlines met in our experiments 2 sources 1 source
Conclusion • First ever solution to transfer data cooperatively between multiple sources with internet and shipping edges • Produce optimal transfer plans that obey time deadlines and minimize dollar cost • Better than internet-only and shipping-only strategies • Reasonable computation time by using optimizations