1 / 26

A Flexible GridFTP Client for Implementation of Intelligent Cloud Data Scheduling Services

A Flexible GridFTP Client for Implementation of Intelligent Cloud Data Scheduling Services. DATACLOUD 2013. Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey. Outline. Data Scheduling Services in the Cloud File Transfer Scheduling Problem History

conway
Download Presentation

A Flexible GridFTP Client for Implementation of Intelligent Cloud Data Scheduling Services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Flexible GridFTP Client for Implementation of Intelligent Cloud Data Scheduling Services DATACLOUD 2013 EsmaYildirim Department of Computer Engineering Fatih University Istanbul, Turkey

  2. Outline Data Scheduling Services in the Cloud File Transfer Scheduling Problem History Implementation Details of the Client Example Algorithms Amazon EC2 Experiments Conclusions

  3. Cloud Data Scheduling Services Data Clouds strive for novel services for management, analysis, access and scheduling of Big Data Application level protocols providing high performance in high speed networks is an integral part of data scheduling services GridFTP, UDP based protocols are used frequently in Modern Day Schedulers (e.g. GlobusOnline, StorkCloud)

  4. Bottlenecks in Data Scheduling Services • Data is large, diverse and complex • Transferring large datasets faces many bottlenecks • Transport protocol’s under utilization of network • End-system limitations (e.g. CPU, NIC and disk speed) • Dataset characteristics • Many short duration transfers • Connection startup and tear down overhead

  5. Optimizations in GridFTP Protocol: Pipelining, Parallelism and Concurrency

  6. Application in Data Scheduling Services • Setting optimal parameters for different datasets is a challenging task • Data Scheduling Services sets static values based on experiences • Provided tools do not comply with dynamic intelligent algorithms that might change settings on the fly

  7. Goals of the Flexible Client Flexibility to scalable data scheduling algorithms On the fly changes to the optimization parameters Reshaping the dataset characteristics

  8. File Transfer Scheduling Problem • Lies at the origin of the data scheduling services • Dates back to 1980s • Earliest approaches: List scheduling • Sort the transfers based on size, bandwidth of the path or duration of the transfer • Near-optimal solution • Integer programming – not feasible to implement

  9. File Transfer Scheduling Problem • Scalable approaches: • Transferring from multiple replicas • Divided datasets sent over different paths to make use of additional network bandwidth • Adaptive approaches • Divide files into multiple portions to send over parallel streams • Divide dataset into multiple portions and send at the same time • Adaptively change level of concurrency or parallelism based on network throughput • Optimization algorithms • Find optimal settings via modeling and set the optimal parameters once and for all

  10. File Transfer Scheduling Problem • Modern Day Data Scheduling Service Example • Globus Online • Hosted SaaS • Statically set pipelining, concurrency and parallelism • Stork • Multi-protocol support • Finds optimal parallelism level based on modeling • Static job concurrency

  11. Ideal Client Interface • Allow dataset transfers to be • Enqueued, dequeued • Sorted based on a property • Divided, combined into chunks • Grouped by source-destination paths • Done from multiple replicas

  12. Implementation Details • Lacks of globus-url-copy • Does not let even static setting of pipelining, uses its own default value invisible to the user • globus-url-copy -pp -p 5 -cc 4 srcurldesturl • A directory of files can not be divided and set different optimization parameters • Filelist option does help but it can not apply pipelining on the list as the developers indicates • globus-url-copy -pp -p 5 -cc 4 -ffilelist.txt

  13. Implementation Details • File data structure properties • File size: used to construct data chunks based on total size, throughput calculation, transfer duration calculation • Source and destination paths: necessary for combining and dividing datasets, changing the source path based on replica location • File name: Necessary to reconstruct full paths

  14. Implementation Details • Listing the files for a given path • Contacts the GridFTP server • Pulls information about the files in the given path • Provides a list of file data structures including the number of files • Makes it easier to divide, combine, sort , enqueue and dequeue on a list of files

  15. Implementation Details • Performing the actual transfer • Sets the optimization parameters on a list of files returned by the list function and manipulated by different algorithms • For a data chunk it sets the parallel stream, concurrency and pipelining value

  16. Example Algorithms 1: Adaptive Concurrency • Takes a file list structure returned by the list function as input • Divides the file list into chunks based on the number of files in a chunk • Starting with concurrency level of 1 , transfer each chunk with an exponentially increasing concurrency level as long as the throughput increases by each chunk transfer • If the throughput drops adaptively the concurrency level is also decreased for the subsequent chunk transfer

  17. Example Algorithms 1: Adaptive Concurrency

  18. Example Algorithm 2: Optimal Pipelining • Mean-based algorithm to construct clusters of files with different optimal pipelining levels • Calculates optimal pipelining level by dividing BDP into mean file size of the chunk • Dataset is recursively divided by the mean file size index as long as the following conditions are met: • A chunk can only be divided further as long as its pipelining level is different than its parent chunk • A chunk can not be less than a preset minimum chunk size • Optimal pipelining level for a chunk cannot be greater than a preset maximum pipelining level

  19. Example Algorithm 2-a: Optimal Pipelining

  20. Example Algorithm 2-b: Optimal Pipelining and Concurrency • After the recursive division of chunks, ppopt is set for each chunk • Chunks go through a revision phase where smaller chunks are combined and larger chunks are further divided • Starting with cc = 1, each chunk is transferred with exponentially increased cc levels until throughput drops down • The rest of the chunks are transferred with the optimal cc level

  21. Example Algorithm 2-b: Optimal Pipelining and Concurrency

  22. Amazon EC2 Experiments • Large nodes with 2vCPUs, 8GB storage, 7.5 GB memory and moderate network performance • 50ms artificial delay • Globus Provision is used to automatic setup of servers • Datasets comprise of many number of small files (most difficult optimization case) • 5000 1MB files • 1000 random size files in range 1Byte to 10MB

  23. Amazon EC2 Experiments: 5000 1MB files Baseline performance: Default pipelining+data channel caching Throughput achieved is higher than baseline for majority of cases

  24. Amazon EC2 Experiments: 1000 random size files

  25. Conclusions The flexible GridFTP client has the ability to comply with different natured data scheduling algorithms Adaptive and optimization algorithms easily sort, divide and combine datasets Possibility to implement intelligent cloud scheduling services in an easier way

  26. Questions?

More Related