180 likes | 343 Views
Clustering the Reliable File Transfer Service. Jim Basney and Patrick Duda NCSA, University of Illinois. This material is based upon work supported by the National Science Foundation under Grant No. 0426972. Goal. Provide a highly available Reliable File Transfer (RFT) Service
E N D
Clustering the Reliable File Transfer Service Jim Basney and Patrick DudaNCSA, University of Illinois This material is based upon work supported by the National Science Foundation under Grant No. 0426972. TeraGrid '07
Goal • Provide a highly availableReliable File Transfer (RFT) Service • Tolerate server failures • Hardware/software faults and resource exhaustion • Continue to handle incoming requests • Continue to make forward progress on file transfers in the queue TeraGrid '07
GridFTP GridFTP Globus ToolkitReliable File Transfer Service Client RFT TeraGrid '07
RFT and GridFTP Clustering GridFTPdata RFT GridFTPcontrol GridFTPdata RFT GridFTPdata RFT GridFTPcontrol GridFTPdata TeraGrid '07
Clustering Approach RFT HADBMS LoadBalancer RFT RFT TeraGrid '07
RFT State Management Web ServiceContainer DelegationService Client RFT DBMS TeraGrid '07
RFT DB Tables Added Fields TeraGrid '07
New Tables TeraGrid '07
RFT Fail-Over • Based on time-outs • Periodically query database for pending requests with no recent activity • Stalled requests could be caused by RFT service crash, hardware failure, RFT service overload, etc. • If found, obtain DB write lock, query again, claim stalled requests, and release lock • Configuration values: • Query interval (default: 30 seconds) • Recent interval (default: 60 seconds) TeraGrid '07
Evaluation Environment • Dedicated 12 node Linux cluster • Red Hat Enterprise Linux AS Release 3 • Switched Gigabit Ethernet • 2 GB RAM • dual 2GHz Intel Xeon CPUs 512KB cache • Globus Toolkit 4.0.3 • MySQL Standard 5.0.27 TeraGrid '07
Evaluation • Correctness / Effectiveness • Submitted multiple RFT requests of different sizes to 12 RFT instances • Verified fail-over and notification functionality • Performance • Evaluate overhead of shared DBMS • Stress test: transfer many small files TeraGrid '07
60 second fail-over interval web servicescontainer stopped fail-over TeraGrid '07
95% 82% 57% 43% 22% 14% 10% 6% 4% TeraGrid '07
Related Work HAND: Highly Available Dynamic Deployment Infrastructure for GT4 Migrate services between containers to maintain availability during planned outages Does not address management of persistent service state or fail-over for unplanned outages myGrid DBMS persistence of WS-ResourceProperties in Apache WSRF Points to a general-purpose approach for DBMS-based persistence of stateful WSRF services TeraGrid '07
Conclusion Clustering RFT provides load-balancing and fail-over with acceptable performance for small clusters Clustering is a promising approach for application to other grid services TeraGrid '07
Future Work Correctly handle replay of FTP deletes Implement credentialRefreshListener Evaluate use of different DBMS solutions Investigate GT4 DBMS persistence in general Investigate use of WS-Naming TeraGrid '07
Thanks! Questions? Comments? This material is based upon work supported by the National Science Foundation under Grant No. 0426972. Performance experiments were conducted on computers at the Technology Research, Education, and Commercialization Center (TRECC), a program of the University of Illinois at Urbana-Champaign, funded by the Office of Naval Research and administered by the National Center for Supercomputing Applications. We thank Tom Roney for his assistance with the TRECC cluster. We also thank Ravi Madduri from the Globus project for answering our questions about RFT. TeraGrid '07