620 likes | 637 Views
Type and Workload Aware Scheduling of Large-Scale Wide-Area Data Transfers. Raj Kettimuthu Advisors: Gagan Agrawal, P. Sadayappan. Data Deluge. Light Source Facilities. Cosmology. Genomics. Climate. Data Need to be Moved.
E N D
Type and Workload Aware Scheduling of Large-Scale Wide-Area Data Transfers Raj Kettimuthu Advisors: Gagan Agrawal, P. Sadayappan
Data Deluge Light Source Facilities Cosmology Genomics Climate
Data Need to be Moved • Experimental or observational facility may not have large-scale storage • Dark energy survey • Computing power required for data analysis not available locally • Light source facilities • Specialized computing systems required for analysis • Visualization of large data sets • Data collected from multiple sources for analysis • Common requirement in genomics • Data replicated for efficient distribution, disaster recovery and other policy reasons • Climate data, data produced at Large Hadron Collider
Data Transfer Requirements • Network requirements report for various science domains • Data transfers in science workflows broadly classified into three categories • Best-effort • Move data as soon as possible • Delayed if the load is high • Batch • Loose timing constraints due to manual steps in processing • Flexible mirroring requirements • Response-critical • Analysis of one experiment data guide selection of parameters for next experiment
Wide-area File Transfer . Wide Area Network Data Transfer Node Site 1 Site 2 Data Transfer Node Parallel Storage Parallel Storage
State of the Art • Concurrent transfers often required to achieve high aggregate throughput • Schedule each request immediately with fixed concurrency • This approach has disadvantages • Under heavy load completion times of all transfer tasks can suffer • Low utilization when number of transfer is small • Provide best-effort service for all transfers • Need for efficient scheduling of multiple transfers • Improve aggregate performance and performance of individual flows • Differentiated service for different transfer types
Our Contributions • Model – predict and control throughput • Data-driven modeling using experimental data • Three new file transfer scheduling algorithms • SchEduler Aware of Load: SEAL • Controls scheduled load, maximize performance • Scheduler TypE Aware and Load aware: STEAL • Differential treatment of best-effort & batch jobs • Response-critical Enabled SEAL: RESEAL • Differentiates best-effort & response-critical jobs
GridFTP • High-performance, secure data transfer protocol optimized for high-bandwidth wide-area networks • Parallel TCP streams, PKI security for authentication, integrity and encryption, checkpointing for transfer restarts • Based on FTP protocol - defines extensions for high-performance operation and security • Globus implementation of GridFTP is widely used. • Globus GridFTP servers support usage statistics collection • Transfer type, size in bytes, start time of the transfer, transfer duration etc. are collected for each transfer
Our Contributions • Model – predict and control throughput • Data-driven modeling using experimental data • SchEduler Aware of Load: SEAL • Controls scheduled load, minimize slowdown • Scheduler TypE Aware and Load aware: STEAL • Differential treatment of best-effort & batch jobs • Response-critical Enabled SEAL: RESEAL • Differentiates best-effort & response-critical jobs
Shared Environment Makes Analytical Modeling Hard Data Transfer Node Data Transfer Node SAN SAN Storage Storage
Data Driven Models • Combines historical data with a correction term for current external load • Takes three pieces of input • Signature for a given transfer • Concurrency level • Total known concurrency at source (“known load at source”) • Total known concurrency at destination (“known load at destination”) • Historical data • Transfer concurrency, known loads, and observed throughput for the source-destination pair • Signatures and observed throughputs from the most recent transfers for the source-destination pair • It produces an estimated throughput as an output.
Data Driven Models • Signature for a given transfer • Concurrency level , known load at source, known load at destination • Historical data • Transfer concurrency, known loads, and observed throughput for the source-destination pair • Signatures and observed throughputs from the most recent transfers for the source-destination pair • Compute the difference between recent transfers and the historical average for the corresponding transfer signature • Gives an estimate of current external load • Determine the average historical throughput for given signature
Data Driven Models • Takes three pieces of input • Signature for a given transfer • Concurrency level • Total known concurrency at source (“known load at source”) • Total known concurrency at destination (“known load at destination”) • File size • Historical data (signatures and observed throughputs) for the source-destination pair • Signatures and observed throughputs from the most recent transfers for the source-destination pair • It produces an estimated throughput as an output.
Data Driven Models – Experimental Setup TACC SDSC NCAR PSC NICS Indiana
Data Driven Models - Evaluation • Ratio experiments – allocate available bandwidth at source to destinations using predefined ratio • Available bandwidth at stampede is 9 Gbps • 1:2:2:3:3 for Mason, Kraken, Blacklight, Gordon, Yellowstone Yellowstone = 3*9Gbps/(2+1+2+3+3) = 27/11 = 2.5Gbps Mason=0.8Gbps, Kraken=1.6Gbps, Blacklight=1.6Gbps, Gordon=2.5Gbps • Factoring experiments – increase destination’s throughput by a factor when source is saturated Mason=0.8Gbps, Kraken=1.6Gbps, Blacklight=1.6Gbps, Gordon=2.5Gbps, Yellowstone=2.5Gbps Mason=1.6Gbps, Kraken=X1Gbps, Blacklight=X2Gbps, Gordon=X3Gbps, Yellowstone=X4Gbps
Model Evaluation – Ratio and Factoring Experiments Increasing Gordon’s baseline throughput by 2x. Concurrency picked by picked by Algorithm for Gordon was 5 Ratios are 4:5:6:8:9 for Kraken, Mason, Blacklight, Gordon, and Yellowstone. Concurrencies picked by Algorithm were {1,3,3,1,1}.
Our Contributions • Model – predict and control throughput • Data-driven modeling using experimental data • SchEduler Aware of Load: SEAL • Controls scheduled load, minimize slowdown • Scheduler TypE Aware and Load aware: STEAL • Differential treatment of best-effort & batch jobs • Response-critical Enabled SEAL: RESEAL • Differentiates best-effort & response-critical jobs
SEAL Motivation – Shared Resources Data Transfer Node Data Transfer Node SAN SAN Storage Storage
SEAL - Metrics Turnaround time – time a job spends in the system: completion time - arrival time Job slowdown – factor slowed relative to the time on a unloaded system: turnaround time / processing time Bounded slowdown in parallel job scheduling Bounded slowdown for wide-area transfers TTideal - Estimated transfer time (TT) under zero load and ideal concurrency TTideal - Computed using the models described before
SEAL – Problem definition • Stream of file transfer requests: <source, source file path, destination, destination file path, file size, arrival time> • Future requests not known a priori • Hosts have different capabilities (CPU, memory, disk, SAN, network interfaces, WAN connection) • Maximum achievable throughput differ for each <source host, destination host> pair • Load at a source, destination, and intervening network vary over time • Each host has a max concurrency limit • Schedule transfers minimize average transfer slowdown
SEAL Algorithm – Main Ideas • Queues to bound concurrency in high-load situations • Increases transfer concurrency during low-load situations • Prioritizes transfers based on their expected slowdown • Four key decisions • Should a new transfer be scheduled or queued? • If scheduledwhat concurrency should be used? • When topreempt a transfer be preempted? • When to changeconcurrency of an active transfer? • Uses both the models andthe observed performance of current transfers Est. transfer time under current load Est. transfer time under zero load
SEAL Algorithm – Main ideas for task = W.peek() do if !saturated(src) and !saturated(dst) then Schedule task else if saturated(src) then CLsrc = FindTasksToPreempt(src,task) end if ifsaturated(dst) then CLdst = FindTasksToPreempt(dst,task) end if Preempttasks in CLsrc U CLdstand scheduletask end if end for
SEAL Algorithm - Illustration • One source and one destination • Total bandwidth – 5Gbps • Width – expected runtime • Height – aggregate throughput Average turnaround time is 10.92 Average turnaround time for baseline is 12.04
SEAL Evaluation - Experimental setup TACC SDSC NCAR PSC NICS Indiana
SEAL Evaluation - Workload traces • Traces from actual executions • Anonymized GridFTP usage statistics • Top 10 busy servers in a 1 month period • Day most bytes transferred by those servers • Busiest (among the 10) server log on that day • Limit length of logs due to production environment • Three 15-minute logs - 25%, 45%, 60% load traces • total bytes transferred / max. that can be transferred • Endpoints anonymized in logs • Weighted random split based on capacities
Our Contributions • Model – predict and control throughput • Data-driven modeling using experimental data • SchEduler Aware of Load: SEAL • Controls scheduled load, minimize slowdown • Scheduler TypE Aware and Load aware: STEAL • Differential treatment of best-effort & batch jobs • Response-critical Enabled SEAL: RESEAL • Differentiates best-effort & response-critical jobs
STEAL Motivation - Transfers Have Different Needs • Certain transfer requests require best-effort service • Some requests tolerate larger delays • Science network requirements reports provide several use cases • Replication often relatively time-insensitive • Science data does not change rapidly • Order of magnitude longer response time than average transfer time • Migration due to changes in storage system availability • Batch transfers • STEAL – differential treatment of best-effort and batch transfers
STEAL - Metrics • Batch transfers • Acceptable to delay them • Slowdown is not a suitable metric • Use as much unused bandwidth as possible • Bi-objective problem • Maximize BB / (BT – BI) for Batch jobs • BT – total BW, BI – BW used by BE jobs, BB – BW used by Batch jobs • Maximize SDI /SDI+B for best-effort jobs • SDI – average slowdown of best-effort jobs with no Batch jobs • SDI+B– average slowdown of best-effort jobs with Batch jobs
STEAL – Problem definition • Stream of file transfer requests: <source, source file path, destination, destination file path, file size, arrival time, batch> • Batch is a Boolean • Future requests not known a priori, resources involved have different capabilities, and load vary over time • Batch transfer requests want to use only the unused bandwidth • Do not have any time constraints • Goal • Minimize average slowdown for best-effort transfers (or maximize the normalized average slowdown metric defined in previous slide) • Maximize spare bandwidth utilization for batch transfers
STEAL Algorithm – Main Ideas • Priority for best-effort transfers • Lower priority for batch tasks • xfactorFT * 0.00001 • Best-effort tasks preempt batch tasks before low-priority best-effort tasks • Switch to best-effort when xfactor goes above a certain value • No preemption of batch tasks by other batch tasks • Improve bandwidth utilization for batch tasks • Preemption by best-effort tasks help higher priority batch tasks
STEAL Algorithm – Main Ideas SEAL Algorithm – Main Ideas Batch tasks scheduled last for task = W.peek() do if !saturated(src) and !saturated(dst) then Schedule task else if saturated(src) then CLsrc = FindTasksToPreempt(src,task) end if ifsaturated(dst) then CLdst = FindTasksToPreempt(dst,task) end if Preempttasks in CLsrc U CLdstand scheduletask end if end for FindBatchTasksToPreempt
STEAL Evaluation - Experimental setup TACC SDSC NCAR PSC NICS Indiana
STEAL Evaluation - Workload traces • Three 15-minute logs - 25%, 45%, and 60% load traces • 60% high variation (60%-HV) trace with greater variation in the best-effort load • Original tasks to be best-effort • 50GB batch tasks to consume unused bandwidth • Batch tasks available at start of schedule • SEAL2: batch tasks xfactors increase as if they had arrived an hour before the start of the schedule
Our Contributions • Model – predict and control throughput • Data-driven modeling using experimental data • SchEduler Aware of Load: SEAL • Controls scheduled load, minimize slowdown • Scheduler TypE Aware and Load aware: STEAL • Differential treatment of best-effort & batch jobs • Response-critical Enabled SEAL: RESEAL • Differentiates best-effort & response-critical jobs
RESEAL - Motivation cii.velo.pnnl.gov ssh Visus Converter Remote Desktop PNNL ciiwin.pnl.gov PIC Lustre FS /pic/projects/cii • Custom Analysis Apps • IDL • AmeriCT • Biofilm Viewer • Visus dtn.pnl.gov APS ANL Instrument Node
RESEAL - Motivation • State of the art is best effort data movement • Efforts on reserving resources to provide quality of service • Advanced networks support reservation • LAN, DTN, SAN, storage system resources – no reservation • Backbone networks overprovisioned Source: https://my.es.net
RESEAL – Problem Definition • Stream of file transfer requests: <source, source file path, destination, dst file path, file size, arrival time, value function> • Value as a function of slowdown • Max Value if Slowdown ≤Slowdownmax • Max Value*(Slowdownzero-Slowdown)/(Slowdownzero-Slowdownmax) otherwise • Max value = ‘A + log(size)’ • Goal • Minimize average slowdown for best-effort transfers • Maximize aggregate value for response-critical transfers
RESEAL - Metrics • Best-effort – Average eslowdown / turnaround time • Response-critical – Aggregate value • Bi-objective problem • Maximize AVA / AVT for Response-critical jobs • AVA – Aggregate Value achieved, AVT – Maximum possible Aggregate Value • Maximize SDI /SDI+B for best-effort jobs • SDI – average slowdown of best-effort jobs with no Batch jobs • SDI+B– average slowdown of best-effort jobs with Batch jobs
Priority for Response-critical Transfers • Max: Maximum value as the priority. • MaxEx: Account both the maximum value as well as the expected value. • maximum value * maximum value / expected value • expected value = value(xfactor) • All response-critical tasks get higher priority over best-effort tasks. • MaxExNice: Uses same priority as MaxEx for response-critical tasks • RC tasks yield maximum value as long as they finish with a slowdown ≤ Slowdownmax • High priority and Low priority RC tasks • Prioritize BE tasks over low priority RC tasks
RESEAL Algorithm – Main Ideas SEAL Algorithm – Main Ideas Schedule high priority RC tasks BE.peek() for task = W.peek() do if !saturated(src) and !saturated(dst) or small then Schedule task else if saturated(src) then CLsrc = FindTasksToPreempt(src,task) end if ifsaturated(dst) then CLdst = FindTasksToPreempt(dst,task) end if Preempttasks in CLsrc U CLdstand scheduletask end if end for Schedule low priority RC tasks
RESEAL Algorithm - Illustration • One Source, one destination • Max throughput 1GB/s • RC1, a 1GB file arrived before time X and waiting in the queue • Src, dst saturated with other RC tasks • At time X + 1, RC2 and BE1 arrive • At X+ 1, all tasks complete • Only RC1, RC2, BE1 to be scheduled • xfactorof RC1 is 2.35 at X+1. • xfactorof both RC2 and BE1 is 1 • No more tasks until X + 5.