A Flexible GridFTP Client for Implementation of Intelligent Cloud Data Scheduling Services

A Flexible GridFTP Client for Implementation of Intelligent Cloud Data Scheduling Services DATACLOUD 2013 EsmaYildirim Department of Computer Engineering Fatih University Istanbul, Turkey

Outline Data Scheduling Services in the Cloud File Transfer Scheduling Problem History Implementation Details of the Client Example Algorithms Amazon EC2 Experiments Conclusions

Cloud Data Scheduling Services Data Clouds strive for novel services for management, analysis, access and scheduling of Big Data Application level protocols providing high performance in high speed networks is an integral part of data scheduling services GridFTP, UDP based protocols are used frequently in Modern Day Schedulers (e.g. GlobusOnline, StorkCloud)

Bottlenecks in Data Scheduling Services • Data is large, diverse and complex • Transferring large datasets faces many bottlenecks • Transport protocol’s under utilization of network • End-system limitations (e.g. CPU, NIC and disk speed) • Dataset characteristics • Many short duration transfers • Connection startup and tear down overhead

Optimizations in GridFTP Protocol: Pipelining, Parallelism and Concurrency

Application in Data Scheduling Services • Setting optimal parameters for different datasets is a challenging task • Data Scheduling Services sets static values based on experiences • Provided tools do not comply with dynamic intelligent algorithms that might change settings on the fly

Goals of the Flexible Client Flexibility to scalable data scheduling algorithms On the fly changes to the optimization parameters Reshaping the dataset characteristics

File Transfer Scheduling Problem • Lies at the origin of the data scheduling services • Dates back to 1980s • Earliest approaches: List scheduling • Sort the transfers based on size, bandwidth of the path or duration of the transfer • Near-optimal solution • Integer programming – not feasible to implement

File Transfer Scheduling Problem • Scalable approaches: • Transferring from multiple replicas • Divided datasets sent over different paths to make use of additional network bandwidth • Adaptive approaches • Divide files into multiple portions to send over parallel streams • Divide dataset into multiple portions and send at the same time • Adaptively change level of concurrency or parallelism based on network throughput • Optimization algorithms • Find optimal settings via modeling and set the optimal parameters once and for all

File Transfer Scheduling Problem • Modern Day Data Scheduling Service Example • Globus Online • Hosted SaaS • Statically set pipelining, concurrency and parallelism • Stork • Multi-protocol support • Finds optimal parallelism level based on modeling • Static job concurrency

Ideal Client Interface • Allow dataset transfers to be • Enqueued, dequeued • Sorted based on a property • Divided, combined into chunks • Grouped by source-destination paths • Done from multiple replicas

Implementation Details • Lacks of globus-url-copy • Does not let even static setting of pipelining, uses its own default value invisible to the user • globus-url-copy -pp -p 5 -cc 4 srcurldesturl • A directory of files can not be divided and set different optimization parameters • Filelist option does help but it can not apply pipelining on the list as the developers indicates • globus-url-copy -pp -p 5 -cc 4 -ffilelist.txt

Implementation Details • File data structure properties • File size: used to construct data chunks based on total size, throughput calculation, transfer duration calculation • Source and destination paths: necessary for combining and dividing datasets, changing the source path based on replica location • File name: Necessary to reconstruct full paths

Implementation Details • Listing the files for a given path • Contacts the GridFTP server • Pulls information about the files in the given path • Provides a list of file data structures including the number of files • Makes it easier to divide, combine, sort , enqueue and dequeue on a list of files

Implementation Details • Performing the actual transfer • Sets the optimization parameters on a list of files returned by the list function and manipulated by different algorithms • For a data chunk it sets the parallel stream, concurrency and pipelining value

Example Algorithms 1: Adaptive Concurrency • Takes a file list structure returned by the list function as input • Divides the file list into chunks based on the number of files in a chunk • Starting with concurrency level of 1 , transfer each chunk with an exponentially increasing concurrency level as long as the throughput increases by each chunk transfer • If the throughput drops adaptively the concurrency level is also decreased for the subsequent chunk transfer

Example Algorithms 1: Adaptive Concurrency

Example Algorithm 2: Optimal Pipelining • Mean-based algorithm to construct clusters of files with different optimal pipelining levels • Calculates optimal pipelining level by dividing BDP into mean file size of the chunk • Dataset is recursively divided by the mean file size index as long as the following conditions are met: • A chunk can only be divided further as long as its pipelining level is different than its parent chunk • A chunk can not be less than a preset minimum chunk size • Optimal pipelining level for a chunk cannot be greater than a preset maximum pipelining level

Example Algorithm 2-a: Optimal Pipelining

Example Algorithm 2-b: Optimal Pipelining and Concurrency • After the recursive division of chunks, ppopt is set for each chunk • Chunks go through a revision phase where smaller chunks are combined and larger chunks are further divided • Starting with cc = 1, each chunk is transferred with exponentially increased cc levels until throughput drops down • The rest of the chunks are transferred with the optimal cc level

Example Algorithm 2-b: Optimal Pipelining and Concurrency

Amazon EC2 Experiments • Large nodes with 2vCPUs, 8GB storage, 7.5 GB memory and moderate network performance • 50ms artificial delay • Globus Provision is used to automatic setup of servers • Datasets comprise of many number of small files (most difficult optimization case) • 5000 1MB files • 1000 random size files in range 1Byte to 10MB

Amazon EC2 Experiments: 5000 1MB files Baseline performance: Default pipelining+data channel caching Throughput achieved is higher than baseline for majority of cases

Amazon EC2 Experiments: 1000 random size files

Conclusions The flexible GridFTP client has the ability to comply with different natured data scheduling algorithms Adaptive and optimization algorithms easily sort, divide and combine datasets Possibility to implement intelligent cloud scheduling services in an easier way

Questions?

A Flexible GridFTP Client for Implementation of Intelligent Cloud Data Scheduling Services

A Flexible GridFTP Client for Implementation of Intelligent Cloud Data Scheduling Services

Presentation Transcript

Flexible Scheduling

Flexible Scheduling

GridFTP

Cloud Services for Big Data Analytics

Flexible Scheduling 101

Flexible Energy Scheduling Tool for Integration of Variable generation

Flexible Plan Client Presentation

Scheduling in Cloud

Cloud Scheduling

GridFTP

GT4 GridFTP for Users: The New GridFTP Server

Grid Data Management GridFTP

Flexible Implementation

GridFTP

GT4 GridFTP for Developers: The New GridFTP Server

A thin-client, flexible Fortis rendering engine

Cloud Scheduling

Guidance for Flexible Scheduling

Cloud-DLS: Dynamic trusted scheduling for Cloud computing

Cloud Task Scheduling

Community Cloud Implementation Services for High-Tech Industry

Guidance for Flexible Scheduling