1 / 62

Type and Workload Aware Scheduling of Large-Scale Wide-Area Data Transfers

Type and Workload Aware Scheduling of Large-Scale Wide-Area Data Transfers. Raj Kettimuthu Advisors: Gagan Agrawal, P. Sadayappan. Data Deluge. Light Source Facilities. Cosmology. Genomics. Climate. Data Need to be Moved.

ahatter
Download Presentation

Type and Workload Aware Scheduling of Large-Scale Wide-Area Data Transfers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Type and Workload Aware Scheduling of Large-Scale Wide-Area Data Transfers Raj Kettimuthu Advisors: Gagan Agrawal, P. Sadayappan

  2. Data Deluge Light Source Facilities Cosmology Genomics Climate

  3. Data Need to be Moved • Experimental or observational facility may not have large-scale storage • Dark energy survey • Computing power required for data analysis not available locally • Light source facilities • Specialized computing systems required for analysis • Visualization of large data sets • Data collected from multiple sources for analysis • Common requirement in genomics • Data replicated for efficient distribution, disaster recovery and other policy reasons • Climate data, data produced at Large Hadron Collider

  4. Data Transfer Requirements • Network requirements report for various science domains • Data transfers in science workflows broadly classified into three categories • Best-effort • Move data as soon as possible • Delayed if the load is high • Batch • Loose timing constraints due to manual steps in processing • Flexible mirroring requirements • Response-critical • Analysis of one experiment data guide selection of parameters for next experiment

  5. Wide-area File Transfer . Wide Area Network Data Transfer Node Site 1 Site 2 Data Transfer Node Parallel Storage Parallel Storage

  6. State of the Art • Concurrent transfers often required to achieve high aggregate throughput • Schedule each request immediately with fixed concurrency • This approach has disadvantages • Under heavy load completion times of all transfer tasks can suffer • Low utilization when number of transfer is small • Provide best-effort service for all transfers • Need for efficient scheduling of multiple transfers • Improve aggregate performance and performance of individual flows • Differentiated service for different transfer types

  7. Our Contributions • Model – predict and control throughput • Data-driven modeling using experimental data • Three new file transfer scheduling algorithms • SchEduler Aware of Load: SEAL • Controls scheduled load, maximize performance • Scheduler TypE Aware and Load aware: STEAL • Differential treatment of best-effort & batch jobs • Response-critical Enabled SEAL: RESEAL • Differentiates best-effort & response-critical jobs

  8. GridFTP • High-performance, secure data transfer protocol optimized for high-bandwidth wide-area networks • Parallel TCP streams, PKI security for authentication, integrity and encryption, checkpointing for transfer restarts • Based on FTP protocol - defines extensions for high-performance operation and security • Globus implementation of GridFTP is widely used. • Globus GridFTP servers support usage statistics collection • Transfer type, size in bytes, start time of the transfer, transfer duration etc. are collected for each transfer

  9. Our Contributions • Model – predict and control throughput • Data-driven modeling using experimental data • SchEduler Aware of Load: SEAL • Controls scheduled load, minimize slowdown • Scheduler TypE Aware and Load aware: STEAL • Differential treatment of best-effort & batch jobs • Response-critical Enabled SEAL: RESEAL • Differentiates best-effort & response-critical jobs

  10. Shared Environment Makes Analytical Modeling Hard Data Transfer Node Data Transfer Node SAN SAN Storage Storage

  11. Transfer Load Stable Over Short Periods

  12. Data Driven Models • Combines historical data with a correction term for current external load • Takes three pieces of input • Signature for a given transfer • Concurrency level • Total known concurrency at source (“known load at source”) • Total known concurrency at destination (“known load at destination”) • Historical data • Transfer concurrency, known loads, and observed throughput for the source-destination pair • Signatures and observed throughputs from the most recent transfers for the source-destination pair • It produces an estimated throughput as an output.

  13. Data Driven Models • Signature for a given transfer • Concurrency level , known load at source, known load at destination • Historical data • Transfer concurrency, known loads, and observed throughput for the source-destination pair • Signatures and observed throughputs from the most recent transfers for the source-destination pair • Compute the difference between recent transfers and the historical average for the corresponding transfer signature • Gives an estimate of current external load • Determine the average historical throughput for given signature

  14. Data Driven Models • Takes three pieces of input • Signature for a given transfer • Concurrency level • Total known concurrency at source (“known load at source”) • Total known concurrency at destination (“known load at destination”) • File size • Historical data (signatures and observed throughputs) for the source-destination pair • Signatures and observed throughputs from the most recent transfers for the source-destination pair • It produces an estimated throughput as an output.

  15. Data Driven Models – Experimental Setup TACC SDSC NCAR PSC NICS Indiana

  16. Data Driven Models - Validation

  17. Data Driven Models - Evaluation • Ratio experiments – allocate available bandwidth at source to destinations using predefined ratio • Available bandwidth at stampede is 9 Gbps • 1:2:2:3:3 for Mason, Kraken, Blacklight, Gordon, Yellowstone Yellowstone = 3*9Gbps/(2+1+2+3+3) = 27/11 = 2.5Gbps Mason=0.8Gbps, Kraken=1.6Gbps, Blacklight=1.6Gbps, Gordon=2.5Gbps • Factoring experiments – increase destination’s throughput by a factor when source is saturated Mason=0.8Gbps, Kraken=1.6Gbps, Blacklight=1.6Gbps, Gordon=2.5Gbps, Yellowstone=2.5Gbps Mason=1.6Gbps, Kraken=X1Gbps, Blacklight=X2Gbps, Gordon=X3Gbps, Yellowstone=X4Gbps

  18. Model Evaluation – Ratio and Factoring Experiments Increasing Gordon’s baseline throughput by 2x. Concurrency picked by picked by Algorithm for Gordon was 5 Ratios are 4:5:6:8:9 for Kraken, Mason, Blacklight, Gordon, and Yellowstone. Concurrencies picked by Algorithm were {1,3,3,1,1}.

  19. Our Contributions • Model – predict and control throughput • Data-driven modeling using experimental data • SchEduler Aware of Load: SEAL • Controls scheduled load, minimize slowdown • Scheduler TypE Aware and Load aware: STEAL • Differential treatment of best-effort & batch jobs • Response-critical Enabled SEAL: RESEAL • Differentiates best-effort & response-critical jobs

  20. SEAL Motivation – Shared Resources Data Transfer Node Data Transfer Node SAN SAN Storage Storage

  21. SEAL Motivation – Concurrency Trends

  22. SEAL Motivation – Load Varies Greatly

  23. SEAL - Metrics Turnaround time – time a job spends in the system: completion time - arrival time Job slowdown – factor slowed relative to the time on a unloaded system: turnaround time / processing time Bounded slowdown in parallel job scheduling Bounded slowdown for wide-area transfers TTideal - Estimated transfer time (TT) under zero load and ideal concurrency TTideal - Computed using the models described before

  24. SEAL – Problem definition • Stream of file transfer requests: <source, source file path, destination, destination file path, file size, arrival time> • Future requests not known a priori • Hosts have different capabilities (CPU, memory, disk, SAN, network interfaces, WAN connection) • Maximum achievable throughput differ for each <source host, destination host> pair • Load at a source, destination, and intervening network vary over time • Each host has a max concurrency limit • Schedule transfers minimize average transfer slowdown

  25. SEAL Algorithm – Main Ideas • Queues to bound concurrency in high-load situations • Increases transfer concurrency during low-load situations • Prioritizes transfers based on their expected slowdown • Four key decisions • Should a new transfer be scheduled or queued? • If scheduledwhat concurrency should be used? • When topreempt a transfer be preempted? • When to changeconcurrency of an active transfer? • Uses both the models andthe observed performance of current transfers Est. transfer time under current load Est. transfer time under zero load

  26. SEAL Algorithm – Main ideas for task = W.peek() do if !saturated(src) and !saturated(dst) then Schedule task else if saturated(src) then CLsrc = FindTasksToPreempt(src,task) end if ifsaturated(dst) then CLdst = FindTasksToPreempt(dst,task) end if Preempttasks in CLsrc U CLdstand scheduletask end if end for

  27. SEAL Algorithm - Illustration • One source and one destination • Total bandwidth – 5Gbps • Width – expected runtime • Height – aggregate throughput Average turnaround time is 10.92 Average turnaround time for baseline is 12.04

  28. SEAL Evaluation - Experimental setup TACC SDSC NCAR PSC NICS Indiana

  29. SEAL Evaluation - Workload traces • Traces from actual executions • Anonymized GridFTP usage statistics • Top 10 busy servers in a 1 month period • Day most bytes transferred by those servers • Busiest (among the 10) server log on that day • Limit length of logs due to production environment • Three 15-minute logs - 25%, 45%, 60% load traces • total bytes transferred / max. that can be transferred • Endpoints anonymized in logs • Weighted random split based on capacities

  30. SEAL Evaluation – Turnaround Time 60% Load

  31. SEAL Evaluation – Worst Case Performance 60% Load

  32. SEAL Evaluation - SEAL vs Improved Baseline – 60% Load

  33. Our Contributions • Model – predict and control throughput • Data-driven modeling using experimental data • SchEduler Aware of Load: SEAL • Controls scheduled load, minimize slowdown • Scheduler TypE Aware and Load aware: STEAL • Differential treatment of best-effort & batch jobs • Response-critical Enabled SEAL: RESEAL • Differentiates best-effort & response-critical jobs

  34. STEAL Motivation - Transfers Have Different Needs • Certain transfer requests require best-effort service • Some requests tolerate larger delays • Science network requirements reports provide several use cases • Replication often relatively time-insensitive • Science data does not change rapidly • Order of magnitude longer response time than average transfer time • Migration due to changes in storage system availability • Batch transfers • STEAL – differential treatment of best-effort and batch transfers

  35. STEAL - Metrics • Batch transfers • Acceptable to delay them • Slowdown is not a suitable metric • Use as much unused bandwidth as possible • Bi-objective problem • Maximize BB / (BT – BI) for Batch jobs • BT – total BW, BI – BW used by BE jobs, BB – BW used by Batch jobs • Maximize SDI /SDI+B for best-effort jobs • SDI – average slowdown of best-effort jobs with no Batch jobs • SDI+B– average slowdown of best-effort jobs with Batch jobs

  36. STEAL – Problem definition • Stream of file transfer requests: <source, source file path, destination, destination file path, file size, arrival time, batch> • Batch is a Boolean • Future requests not known a priori, resources involved have different capabilities, and load vary over time • Batch transfer requests want to use only the unused bandwidth • Do not have any time constraints • Goal • Minimize average slowdown for best-effort transfers (or maximize the normalized average slowdown metric defined in previous slide) • Maximize spare bandwidth utilization for batch transfers

  37. STEAL Algorithm – Main Ideas • Priority for best-effort transfers • Lower priority for batch tasks • xfactorFT * 0.00001 • Best-effort tasks preempt batch tasks before low-priority best-effort tasks • Switch to best-effort when xfactor goes above a certain value • No preemption of batch tasks by other batch tasks • Improve bandwidth utilization for batch tasks • Preemption by best-effort tasks help higher priority batch tasks

  38. STEAL Algorithm – Main Ideas SEAL Algorithm – Main Ideas Batch tasks scheduled last for task = W.peek() do if !saturated(src) and !saturated(dst) then Schedule task else if saturated(src) then CLsrc = FindTasksToPreempt(src,task) end if ifsaturated(dst) then CLdst = FindTasksToPreempt(dst,task) end if Preempttasks in CLsrc U CLdstand scheduletask end if end for FindBatchTasksToPreempt

  39. STEAL Evaluation - Experimental setup TACC SDSC NCAR PSC NICS Indiana

  40. STEAL Evaluation - Workload traces • Three 15-minute logs - 25%, 45%, and 60% load traces • 60% high variation (60%-HV) trace with greater variation in the best-effort load • Original tasks to be best-effort • 50GB batch tasks to consume unused bandwidth • Batch tasks available at start of schedule • SEAL2: batch tasks xfactors increase as if they had arrived an hour before the start of the schedule

  41. STEAL Evaluation – 60% Load

  42. Our Contributions • Model – predict and control throughput • Data-driven modeling using experimental data • SchEduler Aware of Load: SEAL • Controls scheduled load, minimize slowdown • Scheduler TypE Aware and Load aware: STEAL • Differential treatment of best-effort & batch jobs • Response-critical Enabled SEAL: RESEAL • Differentiates best-effort & response-critical jobs

  43. RESEAL - Motivation cii.velo.pnnl.gov ssh Visus Converter Remote Desktop PNNL ciiwin.pnl.gov PIC Lustre FS /pic/projects/cii • Custom Analysis Apps • IDL • AmeriCT • Biofilm Viewer • Visus dtn.pnl.gov APS ANL Instrument Node

  44. RESEAL - Motivation • State of the art is best effort data movement • Efforts on reserving resources to provide quality of service • Advanced networks support reservation • LAN, DTN, SAN, storage system resources – no reservation • Backbone networks overprovisioned Source: https://my.es.net

  45. RESEAL – Problem Definition • Stream of file transfer requests: <source, source file path, destination, dst file path, file size, arrival time, value function> • Value as a function of slowdown • Max Value if Slowdown ≤Slowdownmax • Max Value*(Slowdownzero-Slowdown)/(Slowdownzero-Slowdownmax) otherwise • Max value = ‘A + log(size)’ • Goal • Minimize average slowdown for best-effort transfers • Maximize aggregate value for response-critical transfers

  46. RESEAL - Metrics • Best-effort – Average eslowdown / turnaround time • Response-critical – Aggregate value • Bi-objective problem • Maximize AVA / AVT for Response-critical jobs • AVA – Aggregate Value achieved, AVT – Maximum possible Aggregate Value • Maximize SDI /SDI+B for best-effort jobs • SDI – average slowdown of best-effort jobs with no Batch jobs • SDI+B– average slowdown of best-effort jobs with Batch jobs

  47. Priority for Response-critical Transfers • Max: Maximum value as the priority. • MaxEx: Account both the maximum value as well as the expected value. • maximum value * maximum value / expected value • expected value = value(xfactor) • All response-critical tasks get higher priority over best-effort tasks. • MaxExNice: Uses same priority as MaxEx for response-critical tasks • RC tasks yield maximum value as long as they finish with a slowdown ≤ Slowdownmax • High priority and Low priority RC tasks • Prioritize BE tasks over low priority RC tasks

  48. RESEAL Algorithm – Main Ideas SEAL Algorithm – Main Ideas Schedule high priority RC tasks BE.peek() for task = W.peek() do if !saturated(src) and !saturated(dst) or small then Schedule task else if saturated(src) then CLsrc = FindTasksToPreempt(src,task) end if ifsaturated(dst) then CLdst = FindTasksToPreempt(dst,task) end if Preempttasks in CLsrc U CLdstand scheduletask end if end for Schedule low priority RC tasks

  49. RESEAL Algorithm - Illustration • One Source, one destination • Max throughput 1GB/s • RC1, a 1GB file arrived before time X and waiting in the queue • Src, dst saturated with other RC tasks • At time X + 1, RC2 and BE1 arrive • At X+ 1, all tasks complete • Only RC1, RC2, BE1 to be scheduled • xfactorof RC1 is 2.35 at X+1. • xfactorof both RC2 and BE1 is 1 • No more tasks until X + 5.

  50. RESEAL Algorithm - Illustration

More Related