260 likes | 413 Views
A Flexible GridFTP Client for Implementation of Intelligent Cloud Data Scheduling Services. DATACLOUD 2013. Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey. Outline. Data Scheduling Services in the Cloud File Transfer Scheduling Problem History
E N D
A Flexible GridFTP Client for Implementation of Intelligent Cloud Data Scheduling Services DATACLOUD 2013 EsmaYildirim Department of Computer Engineering Fatih University Istanbul, Turkey
Outline Data Scheduling Services in the Cloud File Transfer Scheduling Problem History Implementation Details of the Client Example Algorithms Amazon EC2 Experiments Conclusions
Cloud Data Scheduling Services Data Clouds strive for novel services for management, analysis, access and scheduling of Big Data Application level protocols providing high performance in high speed networks is an integral part of data scheduling services GridFTP, UDP based protocols are used frequently in Modern Day Schedulers (e.g. GlobusOnline, StorkCloud)
Bottlenecks in Data Scheduling Services • Data is large, diverse and complex • Transferring large datasets faces many bottlenecks • Transport protocol’s under utilization of network • End-system limitations (e.g. CPU, NIC and disk speed) • Dataset characteristics • Many short duration transfers • Connection startup and tear down overhead
Optimizations in GridFTP Protocol: Pipelining, Parallelism and Concurrency
Application in Data Scheduling Services • Setting optimal parameters for different datasets is a challenging task • Data Scheduling Services sets static values based on experiences • Provided tools do not comply with dynamic intelligent algorithms that might change settings on the fly
Goals of the Flexible Client Flexibility to scalable data scheduling algorithms On the fly changes to the optimization parameters Reshaping the dataset characteristics
File Transfer Scheduling Problem • Lies at the origin of the data scheduling services • Dates back to 1980s • Earliest approaches: List scheduling • Sort the transfers based on size, bandwidth of the path or duration of the transfer • Near-optimal solution • Integer programming – not feasible to implement
File Transfer Scheduling Problem • Scalable approaches: • Transferring from multiple replicas • Divided datasets sent over different paths to make use of additional network bandwidth • Adaptive approaches • Divide files into multiple portions to send over parallel streams • Divide dataset into multiple portions and send at the same time • Adaptively change level of concurrency or parallelism based on network throughput • Optimization algorithms • Find optimal settings via modeling and set the optimal parameters once and for all
File Transfer Scheduling Problem • Modern Day Data Scheduling Service Example • Globus Online • Hosted SaaS • Statically set pipelining, concurrency and parallelism • Stork • Multi-protocol support • Finds optimal parallelism level based on modeling • Static job concurrency
Ideal Client Interface • Allow dataset transfers to be • Enqueued, dequeued • Sorted based on a property • Divided, combined into chunks • Grouped by source-destination paths • Done from multiple replicas
Implementation Details • Lacks of globus-url-copy • Does not let even static setting of pipelining, uses its own default value invisible to the user • globus-url-copy -pp -p 5 -cc 4 srcurldesturl • A directory of files can not be divided and set different optimization parameters • Filelist option does help but it can not apply pipelining on the list as the developers indicates • globus-url-copy -pp -p 5 -cc 4 -ffilelist.txt
Implementation Details • File data structure properties • File size: used to construct data chunks based on total size, throughput calculation, transfer duration calculation • Source and destination paths: necessary for combining and dividing datasets, changing the source path based on replica location • File name: Necessary to reconstruct full paths
Implementation Details • Listing the files for a given path • Contacts the GridFTP server • Pulls information about the files in the given path • Provides a list of file data structures including the number of files • Makes it easier to divide, combine, sort , enqueue and dequeue on a list of files
Implementation Details • Performing the actual transfer • Sets the optimization parameters on a list of files returned by the list function and manipulated by different algorithms • For a data chunk it sets the parallel stream, concurrency and pipelining value
Example Algorithms 1: Adaptive Concurrency • Takes a file list structure returned by the list function as input • Divides the file list into chunks based on the number of files in a chunk • Starting with concurrency level of 1 , transfer each chunk with an exponentially increasing concurrency level as long as the throughput increases by each chunk transfer • If the throughput drops adaptively the concurrency level is also decreased for the subsequent chunk transfer
Example Algorithm 2: Optimal Pipelining • Mean-based algorithm to construct clusters of files with different optimal pipelining levels • Calculates optimal pipelining level by dividing BDP into mean file size of the chunk • Dataset is recursively divided by the mean file size index as long as the following conditions are met: • A chunk can only be divided further as long as its pipelining level is different than its parent chunk • A chunk can not be less than a preset minimum chunk size • Optimal pipelining level for a chunk cannot be greater than a preset maximum pipelining level
Example Algorithm 2-b: Optimal Pipelining and Concurrency • After the recursive division of chunks, ppopt is set for each chunk • Chunks go through a revision phase where smaller chunks are combined and larger chunks are further divided • Starting with cc = 1, each chunk is transferred with exponentially increased cc levels until throughput drops down • The rest of the chunks are transferred with the optimal cc level
Amazon EC2 Experiments • Large nodes with 2vCPUs, 8GB storage, 7.5 GB memory and moderate network performance • 50ms artificial delay • Globus Provision is used to automatic setup of servers • Datasets comprise of many number of small files (most difficult optimization case) • 5000 1MB files • 1000 random size files in range 1Byte to 10MB
Amazon EC2 Experiments: 5000 1MB files Baseline performance: Default pipelining+data channel caching Throughput achieved is higher than baseline for majority of cases
Conclusions The flexible GridFTP client has the ability to comply with different natured data scheduling algorithms Adaptive and optimization algorithms easily sort, divide and combine datasets Possibility to implement intelligent cloud scheduling services in an easier way