90 likes | 105 Views
Explore key insights and best practices in achieving end-to-end performance for real users in grid networking. Topics include network monitoring, capacity planning, throughput optimization, hardware, protocol investigations, and application analysis.
E N D
Lessons Learned inGrid NetworkingorHow do we get end-2-end performanceto Real Users ? Richard Hughes-Jones GNEW2004 CERN March 2004 R. Hughes-Jones Manchester
Network Monitoring is Essential • Detect or X-check problem reports • Isolate / determine a performance issue • Capacity planning • Publication of data: network “cost” for middleware • RBs for optimized matchmaking • WP2 Replica Manager • Capacity planning • SLA verification • Isolate / determine throughput bottleneck – work with real user problems • Test conditions for Protocol/HW investigations • Protocol performance / development • Hardware performance / development • Application analysis • Input to middleware – eg gridftp throughput • Isolate / determine a (user) performance issue • Hardware / protocol investigations • End2End Time Series • Throughput UDP/TCP • Rtt • Packet loss • Passive Monitoring • Routers Switches SNMP MRTG • Historical MRTG • Packet/Protocol Dynamics • tcpdump • web100 • Output from Application tools GNEW2004 CERN March 2004 R. Hughes-Jones Manchester
Multi-Gigabit transfers are possible and stable 10 GigEthernet at SC2003 BW Challenge • Three Server systems with 10 GigEthernet NICs • Used the DataTAG altAIMD stack 9000 byte MTU • Send mem-mem iperf TCP streams From SLAC/FNAL booth in Phoenix to: • Pal Alto PAIX • rtt 17 ms , window 30 MB • Shared with Caltech booth • 4.37 Gbit hstcp I=5% • Then 2.87 Gbit I=16% • Fall corresponds to 10 Gbit on link • 3.3Gbit Scalable I=8% • Tested 2 flows sum 1.9Gbit I=39% • Chicago Starlight • rtt 65 ms , window 60 MB • Phoenix CPU 2.2 GHz • 3.1 Gbit hstcp I=1.6% • Amsterdam SARA • rtt 175 ms , window 200 MB • Phoenix CPU 2.2 GHz • 4.35 Gbit hstcp I=6.9% • Very Stable • Both used Abilene to Chicago GNEW2004 CERN March 2004 R. Hughes-Jones Manchester
The performance of the end host / disks is really important BaBar Case Study: RAID Throughput & PCI Activity • 3Ware 7500-8 RAID5 parallel EIDE • 3Ware forces PCI bus to 33 MHz • BaBar Tyan to MB-NG SuperMicroNetwork mem-mem 619 Mbit/s • Disk – disk throughput bbcp40-45 Mbytes/s (320 – 360 Mbit/s) • PCI bus effectively full! • User throughput ~ 250 Mbit/s • User surprised !! Read from RAID5 Disks Write to RAID5 Disks GNEW2004 CERN March 2004 R. Hughes-Jones Manchester
MB - NG Application design – Throughput + Web100 • 2Gbyte file transferred RAID0 disks • Web100 output every 10 ms • Gridftp • See alternate 600/800 Mbit and zero • Apachie web server + curl-based client • See steady 720 Mbit GNEW2004 CERN March 2004 R. Hughes-Jones Manchester
Network Monitoring is vital • Development of new TCP stacks and non-TCP protocols is required • Multi-Gigabit transfers are possible and stable on current networks • Complementary provision of packet IP & λ-Networks is needed • The performance of the end host / disks is really important • Application design can determine Perceived Network Performance • Helping Real Users is a must – can be harder than herding cats • Cooperation between Network providers, Network Researchers, and Network Users has been impressive • Standards (eg GGF / IETF) are the way forward • Many grid projects just assume the network will work !!! • It takes lots of co-operation to put all the components together GNEW2004 CERN March 2004 R. Hughes-Jones Manchester
GNEW2004 CERN March 2004 R. Hughes-Jones Manchester
mmrbc 512 bytes mmrbc 1024 bytes mmrbc 2048 bytes CSR Access PCI-X Sequence Data Transfer Interrupt & CSR Update mmrbc 4096 bytes Tuning PCI-X: Variation of mmrbc IA32 • 16080 byte packets every 200 µs • Intel PRO/10GbE LR Adapter • PCI-X bus occupancy vs mmrbc • Plot: • Measured times • Times based on PCI-X times from the logic analyser • Expected throughput GNEW2004 CERN March 2004 R. Hughes-Jones Manchester
GGF: Hierarchy Characteristics Document • “A Hierarchy of Network Performance Characteristics for Grid Applications and Services” • Document defines terms & relations: • Network characteristics • Measurement methodologies • Observation • Discusses Nodes & Paths • For each Characteristic • Defines the meaning • Attributes that SHOULD be included • Issues to consider when making an observation • Status: • Originally submitted to GFSG as Community Practice Documentdraft-ggf-nmwg-hierarchy-00.pdf Jul 2003 • Revised to Proposed Recommendation http://www-didc.lbl.gov/NMWG/docs/draft-ggf-nmwg-hierarchy-02.pdf 7 Jan 04 • Now in 60 day Public comment from 28 Jan 04 – 18 days to go. GNEW2004 CERN March 2004 R. Hughes-Jones Manchester