Xrootd Present & Future The Drama Continues

Xrootd Present & FutureThe Drama Continues Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University HEPiX 13-October-05 http://xrootd.slac.stanford.edu

Outline • The state of performance • Single server • Clustered servers • The SRM Debate • The Next Big Thing • Conclusion 2: http://xrootd.slac.stanford.edu

Application Design Point • Complex embarrassingly parallel analysis • Determine particle decay products • 1000’s of parallel clients hitting the same data • Small block sparse random access • Median size < 3K • Uniform seek across whole file (mean 650MB) • Only about 22% of the file read (mean 140MB) 3: http://xrootd.slac.stanford.edu

Performance Measurements • Goals • Very low latency • Handle many parallel clients • Test setup • Sun V20z 1.86MHz dual Opteron, 2GB RAM • 1Gb on board Broadcom NIC (same subnet) • Solaris 10 x86 • Linux RHEL3 2.4.21-2.7.8.ELsmp • Client running BetaMiniApp with analysis removed 4: http://xrootd.slac.stanford.edu

Latency Per Request (xrootd) 5: http://xrootd.slac.stanford.edu

Capacity vs Load (xrootd) 6: http://xrootd.slac.stanford.edu

xrootd Server Scaling • Linear scaling relative to load • Allows deterministic sizing of server • Disk • NIC • CPU • Memory • Performance tied directly to hardware cost • Competitive to best-in-class commercial file servers 7: http://xrootd.slac.stanford.edu

OS Impact on Performance 8: http://xrootd.slac.stanford.edu

Device & Filesystem Impact I/O limited CPU limited UFS good on small reads VXFS good on big reads 1 Event » 2K 9: http://xrootd.slac.stanford.edu

Overhead Distribution 10: http://xrootd.slac.stanford.edu

Network Overhead Dominates 11: http://xrootd.slac.stanford.edu

Xrootd Clustering (SLAC) kan01 kan02 kan03 kan04 kanxx Redirectors kanolb-a bbr-olb03 bbr-olb04 client machines Hidden Details 12: http://xrootd.slac.stanford.edu

Clustering Performance • Design can scale to at least 256,000 servers • SLAC runs a 1,000 node test server cluster • BNL runs a 350 node production server cluster • Self-regulating (via minimal spanning tree algorithm) • 280 nodes self-cluster in about 7 seconds • 890 nodes self-cluster in about 56 seconds • Client overhead is extremely low • Overhead added to meta-data requests (e.g., open) • ~200us * log64(number of servers) / 2 • Zero overhead for I/O 13: http://xrootd.slac.stanford.edu

Current MSS Support • Lightweight agnostic interfaces provided • oss.mssgwcmd command • Invoked for each create, dirlist, mv, rm, stat • oss.stagecmd |command • Long running command, request stream protocol • Used to populate disk cache (i.e., “stage-in”) mssgwcmd MSS xrootd (oss layer) stagecmd 15: http://xrootd.slac.stanford.edu

Future Leaf Node SRM • MSS Interface ideal spot for SRM hook • Use existing hooks or new long running hook • mssgwcmd & stagecmd • oss.srm |command • Processes external disk cache management requests • Should scale quite well Grid srm xrootd (oss layer) MSS 16: http://xrootd.slac.stanford.edu

rc dm BNL/LBL Proposal Replica Services GRID BNL Replica Registration Service & DataMover srm Generic Standard Clients drm LBL das xrootd 17: http://xrootd.slac.stanford.edu

Alternative Root Node SRM • Team olbd with SRM • File management & discovery • Tight management control • Several issues need to be considered • Introduces many new failure modes • Will not generally scale Grid srm olbd (root node) MSS 18: http://xrootd.slac.stanford.edu

SRM Integration Status • Unfortunately, SRM interface in flux • Heavy vs light protocol • Working with LBL team • Working towards OSG sanctioned future proposal • Trying to use the Fermilab SRM • Artem Turnov at IN2P3 exploring issues 19: http://xrootd.slac.stanford.edu

The Next Big Thing High Performance Data Access Servers plus Efficient large scale clustering Allows Novel cost-effective super-fast massive storage Optimized for sparse random access Imagine 30TB of DRAM At commodity prices 20: http://xrootd.slac.stanford.edu

Device Speed Delivery 21: http://xrootd.slac.stanford.edu

Memory Access Characteristics Server: zsuntwo CPU: Sparc NIC: 100Mb OS: Solaris 10 UFS: Sandard 22: http://xrootd.slac.stanford.edu

The Peta-Cache • Cost-effect memory access impacts science • Nature of all random access analysis • Not restricted to just High Energy Physics • Enables faster and more detailed analysis • Opens new analytical frontiers • Have a 64-node test cluster • V20z each with 16GB RAM • 1TB “toy” machine 23: http://xrootd.slac.stanford.edu

Conclusion • High performance data access systems achievable • The devil is in the details • Must understand processing domain and deployment infrastructure • Comprehensive repeatable measurement strategy • High performance and clustering are synergetic • Allows unique performance, usability, scalability, and recoverability characteristics • Such systems produce novel software architectures • Challenges • Creating application algorithms that can make use of such systems • Opportunities • Fast low cost access to huge amounts of data to speed discovery 24: http://xrootd.slac.stanford.edu

Acknowledgements • Fabrizio Furano, INFN Padova • Client-side design & development • Bill Weeks • Performance measurement guru • 100’s of measurements repeated 100’s of times • US Department of Energy • Contract DE-AC02-76SF00515 with Stanford University • And our next mystery guest! 25: http://xrootd.slac.stanford.edu

Xrootd Present &amp; Future The Drama Continues

Xrootd Present &amp; Future The Drama Continues

Presentation Transcript

Xrootd Present & Future The Drama Continues

Xrootd Present & Future The Drama Continues