380 likes | 520 Views
Performance Oriented Data Transferring and Sharing Framework for Scientific Computing. Thesis Proposal Ali Kaplan alikapla@cs.indiana.edu. Outline. Motivation Requirements for Scientific Data Transfer Related Works Our Proposal: GridTorrent Framework Test Results Summary Questions.
E N D
Performance Oriented Data Transferring and Sharing Framework for Scientific Computing Thesis Proposal Ali Kaplan alikapla@cs.indiana.edu
Outline • Motivation • Requirements for Scientific Data Transfer • Related Works • Our Proposal: GridTorrent Framework • Test Results • Summary • Questions
Motivation • Computational science is changing to be data intensive • Scientists are faced with mountains of data that stem from four sources[1]: • New scientific instruments double their output every year or so • Simulations generates flood of data • The Internet and computational Grid allow the replication, creation, and recreation of more data[2]
Motivation (cont.) • Scientific discovery increasingly driven by data collection[3] • Computationally intensive analyses • Massive data collections • Data distributed across networks of varying capability • Internationally distributed collaborations • Data Intensive Science: 2000-2015 • Dominant factor: data growth (1 Petabyte = 1000 TB) • 2000 ~0.5 Petabyte • 2005 ~10 Petabytes • 2010 ~100 Petabytes • 2015 ~1000 Petabytes?
Motivation (cont.) • Scientific applications generates petabytes of data are very diverse. • Fusion power • Climate modeling • Earthquake engineering • Astronomy • Bioinformatics • High-energy physics
Motivation (cont.) • Some examples[] • Climate modeling • Community Climate System Model and other simulation applications generates 1.5 petabytes/year • Bioinformatics • The Pacific Northwest National Laboratory is building new Confocal microscopes which will be generating 5 petabytes/year • High-energy physics • The large hadron collider (LHC) project at CERN will create 15 petabytes/year
Motivation Conclusion • Scientific community has large set of distributed data • Scientists want to analyze or work together on the same data are geographically dispersed
Transferring scientific data over large-scale requires efficient high-performance reliable secure policy-aware management balanced system CPU farms storage network Requirements for Scientific Data Transfer
Is it a new problem? • The answer is no. • There are attempts to meet the above requirements as • GridFTP • GridFTPXIO • GridHTTP • TeraGrid Copy (TGCP) • The Replica Location Service (RLS) • gLite
GridFTP • Extension of the standard FTP protocol • Reliable, • secure • high performance • Efficient • The de facto standard for transferring data in many Grid projects • However, GridFTP does not offer a web service interface.
GridFTP (cont.) • Additional features supported by the GridFTP protocol • Grid Security Infrastructures (GSI) and Kerberos support • Support for reliable and restartable data transfer: restart transfers from point of failure when failures occurred • Partial file transfer: regions of a file transfer. • Parallel data transfer: multiple TCP streams between two network endpoints to improve bandwidth. • Third-party control of data transfer: the ability to control transfers between storage servers from remote (third-party) server.
GridHTTP • Allow large (gigabyte) files to be transferred at optimal speeds using HTTP • Does not deviate from existing HTTP standards, • But describes how to use existing headers and methods to produce an encrypted data stream. • Support bulk data transfers via unencrypted HTTP, • Support authentication and authorization with the usual grid credentials over HTTP.
GridFTPXIO • The GlobuseXtensibleInput/Output (XIO) System • provides an abstraction layer to transport protocols. • enables different I/O problems to be presented uniformly as a simple open/close/read/write (OCRW) interface. • a support framework for developing communication protocols. • an interface that enables an existing application written with XIO to access their hardware. • primary usage scenarios • Independence from the Transport Control Protocol • Ease of Adding GridFTP Support to Third-Party Applications • Ease of Providing GridFTP Access to Data Storage
TeraGrid Copy (TGCP) • TeraGrid Copy (TGCP) solution includes three main components: • GridFTP Service • RFT Service • TGCP shell script • In the stripedconfiguration, • GridFTP service runs on several nodes of a cluster • the data to be transferred is partitioned among the nodes • each node may use several parallel streams to attain the maximum performance
TGCP (cont.) • The tgcp script can use the globus-url-copy tool • (A) in either third-party transfer mode • (B) in conventional GridFTP client mode
TGCP (cont.) • RFT Service will be used to manage the transfer. • adds additional reliability to the transfer request • transfer will be completed, if failure occurred during the transfer.
The Replica Location Service (RLS) • provides a framework for tracking the physical locations of data that has been replicated. • maps logical names to physical names. • Replication of data items can • reduce access latency, • improve data locality, • increase robustness, scalability and performance for distributed applications. • does not operate in isolation, • used with other components like the Reliable File Transfer service, GridFTP, the Metadata Catalog Service, and etc.
RLS (cont.) • The current RLS implementation has the following features. • Local Replica Catalogs (LRCs) • Replica Location Indices (RLIs) • LRCs send information about their state to RLIs using soft state protocols. • Optional "Bloom Filter" compression can be used to summarize the contents of the LRC. • The current RLS implementation maintains static information about the LRCs and RLIs participating in the distributed system.
So, if there are solutions…. • There is no pure P2P data transfer mechanism used in this area. • There are several different protocols • Each one has advantages and disadvantages over others
Our proposal: GridTorrent Framework • We are proposing a new distributed file peer-to-peer protocol in scientific data in an acceptable speed • Similar to (GridFTP) redefining of Bittorrent protocol to adjust it using in scientific data transfer • There are many studies show that Bittorrent can be used for scientific applications
Why we need GridTorrent Framework? • Requirements and characteristics of scientific data transfer • Large and voluminous data set • Security • Reliability • Efficiency • Scalability • User-friendly environment • Balanced • Collaboration
Why we need GridTorrent Framework? (cont.) • GridTorrent has faster download speed • Large and voluminous data set • Balanced • GridTorrent allows to share bandwidths between peers • Efficiency • GridTorrent is based on Bittorrent • Reliability • Scalability
Why we need GridTorrent Framework? (cont.) • GridTorrent has security manager • Security • GridTorrent has content management framework • User-friendly environment • Collaboration
Why Bittorrent? • Alternative Peer to Peer Protocols • FastTrack • Gnutella • eDonkey • Direct Connect • Ares • Why BitTorrent? • Better bandwidth utilization • Never before speeds. • Limit free riding – tit-for-tat • Limit leech attack – coupling upload & download • Spurious files not propagated • Ability to resume a download
Why Bittorrent? (cont.) • Bittorrent proved that it is suitable for distributing very large files. • There are many companies using Bittorrent as distributing protocol • Amazon S3 • Microsoft’s Avalanche (inspired by Bittorrent) • Blizzard (Game production company) • Movie studios
Advantages of GridTorrent Framework • Saves resources by taking advantage of the unused upload capacity of downloaders. • CPU • Network Bandwidth • Disk • Reliable • Jobs can be started and stopped using web interface • Can be deployed under any system • Secure
GridTorrent Framework Components (cont.) • GridTorrent Framework has three major components: • GridTorrent Client • GridTorrent Content Manager • GridTorrent WS-Tracker
GridTorrent Client • It has four components • Torrent Data Sharing Algorithm • Task Manager • WS-Tracker Client • Data Transfer layer • Security Manager
GridTorrent Content Manager • Four main components: • Task Manager • ACL Manager • Content Manager • Collaboration Manager
GridTorrent WS-Tracker • It functions as regular Bittorrent Tracker • Send source and peer list to peers • Update their status • It sends tasks list obtained from GridTorrent Content Manager • All communications are secure (SSL) • It is a webservice
GridTorrent Content Manager • It allows content owner to publish content in different access level. • Public level • User level • Group level • It allows user to create a group and manage it and its member with upload, download access rights.
File size (MB) : 300 MB Number of Streams/Sources: 4 Source machines: gridfarm (Bloomington, IN) LAN test: Iperfbandwith (Mbps): 857 Client machine: complexity (Indianapolis, IN) WAN test: Iperfbandwith (Mbps): 30.2 Client machine: vlab2 (Tallahassee, FL) Initial Test Results
Initial Test Results (cont.) Table 1: Download speed of PTCP vs. GridTorrent with 4 streams/sources Table 2: GridTorrent bandwidth load balancing on downloaded file segment with 4 streams/sources
Research Issues • Current Bittorrent protocol is designed for actual network environment • Modifications needed to provide pure scientific data transfer • modification on message format and frequency • UDP • GridFTP • Requirements needed to provide pure scientific data transfer • Security • Content access management • Searching capability
References • Petascale computational systems, Bell, G.; Gray, J.; Szalay, A. Computer Volume 39, Issue 1, Jan. 2006 Page(s): 110 – 112 • Getting Up To Speed, The Future of Supercomputing, Graham, S.L. Snir, M., Patterson, C.A., (eds), NAE Press, 2004, ISBN 0-309-09502-6 • Overview of Grid Computing, Ian Foster, http://www-fp.mcs.anl.gov/~foster/Talks/ResearchLibraryGroupGridsApril2002.ppt, last seen 2007 • Science-Driven Network Requirements for Esnet, http:// www.es.net/ESnet4/Case-Study-Requirements-Update-With-Exec-Sum-v5.doc, last seen 2007