1 / 38

Performance Oriented Data Transferring and Sharing Framework for Scientific Computing

Performance Oriented Data Transferring and Sharing Framework for Scientific Computing. Thesis Proposal Ali Kaplan alikapla@cs.indiana.edu. Outline. Motivation Requirements for Scientific Data Transfer Related Works Our Proposal: GridTorrent Framework Test Results Summary Questions.

emilia
Download Presentation

Performance Oriented Data Transferring and Sharing Framework for Scientific Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Oriented Data Transferring and Sharing Framework for Scientific Computing Thesis Proposal Ali Kaplan alikapla@cs.indiana.edu

  2. Outline • Motivation • Requirements for Scientific Data Transfer • Related Works • Our Proposal: GridTorrent Framework • Test Results • Summary • Questions

  3. Motivation • Computational science is changing to be data intensive • Scientists are faced with mountains of data that stem from four sources[1]: • New scientific instruments double their output every year or so • Simulations generates flood of data • The Internet and computational Grid allow the replication, creation, and recreation of more data[2]

  4. Motivation (cont.) • Scientific discovery increasingly driven by data collection[3] • Computationally intensive analyses • Massive data collections • Data distributed across networks of varying capability • Internationally distributed collaborations • Data Intensive Science: 2000-2015 • Dominant factor: data growth (1 Petabyte = 1000 TB) • 2000 ~0.5 Petabyte • 2005 ~10 Petabytes • 2010 ~100 Petabytes • 2015 ~1000 Petabytes?

  5. Motivation (cont.) • Scientific applications generates petabytes of data are very diverse. • Fusion power • Climate modeling • Earthquake engineering • Astronomy • Bioinformatics • High-energy physics

  6. Motivation (cont.) • Some examples[] • Climate modeling • Community Climate System Model and other simulation applications generates 1.5 petabytes/year • Bioinformatics • The Pacific Northwest National Laboratory is building new Confocal microscopes which will be generating 5 petabytes/year • High-energy physics • The large hadron collider (LHC) project at CERN will create 15 petabytes/year

  7. Motivation Conclusion • Scientific community has large set of distributed data • Scientists want to analyze or work together on the same data are geographically dispersed

  8. Transferring scientific data over large-scale requires efficient high-performance reliable secure policy-aware management balanced system CPU farms storage network Requirements for Scientific Data Transfer

  9. Is it a new problem? • The answer is no. • There are attempts to meet the above requirements as • GridFTP • GridFTPXIO • GridHTTP • TeraGrid Copy (TGCP) • The Replica Location Service (RLS) • gLite

  10. GridFTP • Extension of the standard FTP protocol • Reliable, • secure • high performance • Efficient • The de facto standard for transferring data in many Grid projects • However, GridFTP does not offer a web service interface.

  11. GridFTP (cont.) • Additional features supported by the GridFTP protocol • Grid Security Infrastructures (GSI) and Kerberos support • Support for reliable and restartable data transfer: restart transfers from point of failure when failures occurred • Partial file transfer: regions of a file transfer. • Parallel data transfer: multiple TCP streams between two network endpoints to improve bandwidth. • Third-party control of data transfer: the ability to control transfers between storage servers from remote (third-party) server.

  12. GridHTTP • Allow large (gigabyte) files to be transferred at optimal speeds using HTTP • Does not deviate from existing HTTP standards, • But describes how to use existing headers and methods to produce an encrypted data stream. • Support bulk data transfers via unencrypted HTTP, • Support authentication and authorization with the usual grid credentials over HTTP.

  13. GridFTPXIO • The GlobuseXtensibleInput/Output (XIO) System • provides an abstraction layer to transport protocols. • enables different I/O problems to be presented uniformly as a simple open/close/read/write (OCRW) interface. • a support framework for developing communication protocols. • an interface that enables an existing application written with XIO to access their hardware. • primary usage scenarios • Independence from the Transport Control Protocol • Ease of Adding GridFTP Support to Third-Party Applications • Ease of Providing GridFTP Access to Data Storage

  14. TeraGrid Copy (TGCP) • TeraGrid Copy (TGCP) solution includes three main components: • GridFTP Service • RFT Service • TGCP shell script • In the stripedconfiguration, • GridFTP service runs on several nodes of a cluster • the data to be transferred is partitioned among the nodes • each node may use several parallel streams to attain the maximum performance

  15. TGCP (cont.) • The tgcp script can use the globus-url-copy tool • (A) in either third-party transfer mode • (B) in conventional GridFTP client mode

  16. TGCP (cont.) • RFT Service will be used to manage the transfer. • adds additional reliability to the transfer request • transfer will be completed, if failure occurred during the transfer.

  17. The Replica Location Service (RLS) • provides a framework for tracking the physical locations of data that has been replicated. • maps logical names to physical names. • Replication of data items can • reduce access latency, • improve data locality, • increase robustness, scalability and performance for distributed applications. • does not operate in isolation, • used with other components like the Reliable File Transfer service, GridFTP, the Metadata Catalog Service, and etc.

  18. RLS (cont.) • The current RLS implementation has the following features. • Local Replica Catalogs (LRCs) • Replica Location Indices (RLIs) • LRCs send information about their state to RLIs using soft state protocols. • Optional "Bloom Filter" compression can be used to summarize the contents of the LRC. • The current RLS implementation maintains static information about the LRCs and RLIs participating in the distributed system.

  19. So, if there are solutions…. • There is no pure P2P data transfer mechanism used in this area. • There are several different protocols • Each one has advantages and disadvantages over others

  20. Our proposal: GridTorrent Framework • We are proposing a new distributed file peer-to-peer protocol in scientific data in an acceptable speed • Similar to (GridFTP) redefining of Bittorrent protocol to adjust it using in scientific data transfer • There are many studies show that Bittorrent can be used for scientific applications

  21. Why we need GridTorrent Framework? • Requirements and characteristics of scientific data transfer • Large and voluminous data set • Security • Reliability • Efficiency • Scalability • User-friendly environment • Balanced • Collaboration

  22. Why we need GridTorrent Framework? (cont.) • GridTorrent has faster download speed • Large and voluminous data set • Balanced • GridTorrent allows to share bandwidths between peers • Efficiency • GridTorrent is based on Bittorrent • Reliability • Scalability

  23. Why we need GridTorrent Framework? (cont.) • GridTorrent has security manager • Security • GridTorrent has content management framework • User-friendly environment • Collaboration

  24. Why Bittorrent? • Alternative Peer to Peer Protocols • FastTrack • Gnutella • eDonkey • Direct Connect • Ares • Why BitTorrent? • Better bandwidth utilization • Never before speeds. • Limit free riding – tit-for-tat • Limit leech attack – coupling upload & download • Spurious files not propagated • Ability to resume a download

  25. Why Bittorrent? (cont.) • Bittorrent proved that it is suitable for distributing very large files. • There are many companies using Bittorrent as distributing protocol • Amazon S3 • Microsoft’s Avalanche (inspired by Bittorrent) • Blizzard (Game production company) • Movie studios

  26. Advantages of GridTorrent Framework • Saves resources by taking advantage of the unused upload capacity of downloaders. • CPU • Network Bandwidth • Disk • Reliable • Jobs can be started and stopped using web interface • Can be deployed under any system • Secure

  27. GridTorrent Framework Components

  28. GridTorrent Framework Components (cont.) • GridTorrent Framework has three major components: • GridTorrent Client • GridTorrent Content Manager • GridTorrent WS-Tracker

  29. GridTorrent Client • It has four components • Torrent Data Sharing Algorithm • Task Manager • WS-Tracker Client • Data Transfer layer • Security Manager

  30. GridTorrent Content Manager • Four main components: • Task Manager • ACL Manager • Content Manager • Collaboration Manager

  31. GridTorrent WS-Tracker • It functions as regular Bittorrent Tracker • Send source and peer list to peers • Update their status • It sends tasks list obtained from GridTorrent Content Manager • All communications are secure (SSL) • It is a webservice

  32. GridTorrent Content Manager • It allows content owner to publish content in different access level. • Public level • User level • Group level • It allows user to create a group and manage it and its member with upload, download access rights.

  33. File size (MB) : 300 MB Number of Streams/Sources: 4 Source machines: gridfarm (Bloomington, IN) LAN test: Iperfbandwith (Mbps): 857 Client machine: complexity (Indianapolis, IN) WAN test: Iperfbandwith (Mbps): 30.2 Client machine: vlab2 (Tallahassee, FL) Initial Test Results

  34. Initial Test Results (cont.) Table 1: Download speed of PTCP vs. GridTorrent with 4 streams/sources Table 2: GridTorrent bandwidth load balancing on downloaded file segment with 4 streams/sources

  35. Initial Test Results (cont.)

  36. Research Issues • Current Bittorrent protocol is designed for actual network environment • Modifications needed to provide pure scientific data transfer • modification on message format and frequency • UDP • GridFTP • Requirements needed to provide pure scientific data transfer • Security • Content access management • Searching capability

  37. Questions?

  38. References • Petascale computational systems, Bell, G.; Gray, J.; Szalay, A. Computer Volume 39, Issue 1, Jan. 2006 Page(s): 110 – 112 • Getting Up To Speed, The Future of Supercomputing, Graham, S.L. Snir, M., Patterson, C.A., (eds), NAE Press, 2004, ISBN 0-309-09502-6 • Overview of Grid Computing, Ian Foster, http://www-fp.mcs.anl.gov/~foster/Talks/ResearchLibraryGroupGridsApril2002.ppt, last seen 2007 • Science-Driven Network Requirements for Esnet, http:// www.es.net/ESnet4/Case-Study-Requirements-Update-With-Exec-Sum-v5.doc, last seen 2007

More Related