240 likes | 251 Views
This presentation discusses the state of performance, application design, performance measurements, server scaling, OS and filesystem impact, clustering, future leaf node SRM, alternative root node SRM, SRM integration status, and the next big thing in high-performance data access servers.
E N D
Xrootd Present & FutureThe Drama Continues Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University HEPiX 13-October-05 http://xrootd.slac.stanford.edu
Outline • The state of performance • Single server • Clustered servers • The SRM Debate • The Next Big Thing • Conclusion 2: http://xrootd.slac.stanford.edu
Application Design Point • Complex embarrassingly parallel analysis • Determine particle decay products • 1000’s of parallel clients hitting the same data • Small block sparse random access • Median size < 3K • Uniform seek across whole file (mean 650MB) • Only about 22% of the file read (mean 140MB) 3: http://xrootd.slac.stanford.edu
Performance Measurements • Goals • Very low latency • Handle many parallel clients • Test setup • Sun V20z 1.86MHz dual Opteron, 2GB RAM • 1Gb on board Broadcom NIC (same subnet) • Solaris 10 x86 • Linux RHEL3 2.4.21-2.7.8.ELsmp • Client running BetaMiniApp with analysis removed 4: http://xrootd.slac.stanford.edu
Latency Per Request (xrootd) 5: http://xrootd.slac.stanford.edu
Capacity vs Load (xrootd) 6: http://xrootd.slac.stanford.edu
xrootd Server Scaling • Linear scaling relative to load • Allows deterministic sizing of server • Disk • NIC • CPU • Memory • Performance tied directly to hardware cost • Competitive to best-in-class commercial file servers 7: http://xrootd.slac.stanford.edu
OS Impact on Performance 8: http://xrootd.slac.stanford.edu
Device & Filesystem Impact I/O limited CPU limited UFS good on small reads VXFS good on big reads 1 Event » 2K 9: http://xrootd.slac.stanford.edu
Overhead Distribution 10: http://xrootd.slac.stanford.edu
Network Overhead Dominates 11: http://xrootd.slac.stanford.edu
Xrootd Clustering (SLAC) kan01 kan02 kan03 kan04 kanxx Redirectors kanolb-a bbr-olb03 bbr-olb04 client machines Hidden Details 12: http://xrootd.slac.stanford.edu
Clustering Performance • Design can scale to at least 256,000 servers • SLAC runs a 1,000 node test server cluster • BNL runs a 350 node production server cluster • Self-regulating (via minimal spanning tree algorithm) • 280 nodes self-cluster in about 7 seconds • 890 nodes self-cluster in about 56 seconds • Client overhead is extremely low • Overhead added to meta-data requests (e.g., open) • ~200us * log64(number of servers) / 2 • Zero overhead for I/O 13: http://xrootd.slac.stanford.edu
Current MSS Support • Lightweight agnostic interfaces provided • oss.mssgwcmd command • Invoked for each create, dirlist, mv, rm, stat • oss.stagecmd |command • Long running command, request stream protocol • Used to populate disk cache (i.e., “stage-in”) mssgwcmd MSS xrootd (oss layer) stagecmd 15: http://xrootd.slac.stanford.edu
Future Leaf Node SRM • MSS Interface ideal spot for SRM hook • Use existing hooks or new long running hook • mssgwcmd & stagecmd • oss.srm |command • Processes external disk cache management requests • Should scale quite well Grid srm xrootd (oss layer) MSS 16: http://xrootd.slac.stanford.edu
rc dm BNL/LBL Proposal Replica Services GRID BNL Replica Registration Service & DataMover srm Generic Standard Clients drm LBL das xrootd 17: http://xrootd.slac.stanford.edu
Alternative Root Node SRM • Team olbd with SRM • File management & discovery • Tight management control • Several issues need to be considered • Introduces many new failure modes • Will not generally scale Grid srm olbd (root node) MSS 18: http://xrootd.slac.stanford.edu
SRM Integration Status • Unfortunately, SRM interface in flux • Heavy vs light protocol • Working with LBL team • Working towards OSG sanctioned future proposal • Trying to use the Fermilab SRM • Artem Turnov at IN2P3 exploring issues 19: http://xrootd.slac.stanford.edu
The Next Big Thing High Performance Data Access Servers plus Efficient large scale clustering Allows Novel cost-effective super-fast massive storage Optimized for sparse random access Imagine 30TB of DRAM At commodity prices 20: http://xrootd.slac.stanford.edu
Device Speed Delivery 21: http://xrootd.slac.stanford.edu
Memory Access Characteristics Server: zsuntwo CPU: Sparc NIC: 100Mb OS: Solaris 10 UFS: Sandard 22: http://xrootd.slac.stanford.edu
The Peta-Cache • Cost-effect memory access impacts science • Nature of all random access analysis • Not restricted to just High Energy Physics • Enables faster and more detailed analysis • Opens new analytical frontiers • Have a 64-node test cluster • V20z each with 16GB RAM • 1TB “toy” machine 23: http://xrootd.slac.stanford.edu
Conclusion • High performance data access systems achievable • The devil is in the details • Must understand processing domain and deployment infrastructure • Comprehensive repeatable measurement strategy • High performance and clustering are synergetic • Allows unique performance, usability, scalability, and recoverability characteristics • Such systems produce novel software architectures • Challenges • Creating application algorithms that can make use of such systems • Opportunities • Fast low cost access to huge amounts of data to speed discovery 24: http://xrootd.slac.stanford.edu
Acknowledgements • Fabrizio Furano, INFN Padova • Client-side design & development • Bill Weeks • Performance measurement guru • 100’s of measurements repeated 100’s of times • US Department of Energy • Contract DE-AC02-76SF00515 with Stanford University • And our next mystery guest! 25: http://xrootd.slac.stanford.edu