520 likes | 664 Views
Atlas Canada Lightpath Data Transfer Trial. Corrie Kost, Steve McDonald (TRIUMF) Bryan Caron (UofAlberta), Wade Hong (Carleton). Brownie 2.5 TeraByte RAID array. 16 x 160 GB IDE disks (5400 rpm 2MB cache) hot swap capable Dual ultra160 SCSI interface to host Maximum transfer ~65 MB/sec
E N D
Atlas Canada Lightpath Data Transfer Trial Corrie Kost, Steve McDonald (TRIUMF) Bryan Caron (UofAlberta), Wade Hong (Carleton)
Brownie 2.5 TeraByte RAID array • 16 x 160 GB IDE disks (5400 rpm 2MB cache) • hot swap capable • Dual ultra160 SCSI interface to host • Maximum transfer ~65 MB/sec • Triple hot swap power supplies • CAN ~$15k • Arrives July 8th 2002
What to Do while waiting for server to arrive • IBM PRO6850 Intellistation (Loan) • Dual 2.2 GHz Xeons • 2 PCI 64bit/66MHz • 4 PCI 33bit/33MHz • 1.5 GB RAMBUS • Add 2 Promise Ultra100 • IDE controllers and 5 Disks • Each disk on its own IDE controller for maximum IO • Begin Linux Software RAID performance tests ~170/130 MB/sec Read/Write
The Long Road to High Disk IO • IBM cluster x330’s RH7.2 disk io ~ 15 MB/sec (slow??) • expect 45 MB/sec for any modern single drive • Need 2.4.18 Linux kernel to support >1TB filesystems • IBM cluster x330’s RH7.3 disk io ~ 3 MB/sec • What is going on • Red Hat modified serverworks driver broke DMA on x330’s • x330’s ATA 100 drive, BUT controller is only UDMA 33 • Promise controllers capable of UDMA 100 but need latest kernel patches for 2.4.18 before drives recognise UDMA100 • Finally drives/controller both working at UDMA100 = 45MB/sec • Linux software raid0 2 drives 90MB/sec, 3 drives 125 MB/sec • 4 drives 155MB/sec, 5 drives 175 MB/sec • Now we are ready to start network transfers
did we So what are we going to do? ---------------------------------- • Demonstrate a manually provisioned “e2e” lightpath • Transfer 1TB of ATLAS MC data generated in Canada from TRIUMF to CERN • Test out 10GbE technology and channel bonding • Establish a new benchmark for high performance disk to disk throughput over a large distance
What is an e2e Lightpath • Core design principle of CA*net 4 • Ultimately to give control of lightpath creation, teardown and routing to the end user • Hence, “Customer Empowered Networks” • Provides a flexible infrastructure for emerging grid applications • Alas, can only do things manually today
The Chicago Loopback • Need to test TCP/IP and Tsunami protocols over long distances, arrange optical loop via StarLight • ( TRIUMF-BCNET-Chicago-BCNET-TRIUMF ) • ~91ms RTT • TRIUMF - CERN RRT ~200ms Told Damir, we really needed to have a double loopback • “No problem” • The loopback2 was setup a few days later (RTT=193ms) • (TRIUMF-BCNET-Chicago-BCNET-Chicago-BCNET-TRIUMF)
TRIUMF Server SuperMicro P4DL6 (Dual Xeon 2GHz) 400 MHz front side bus 1 GB DDR2100 RAM Dual Channel Ultra 160 onboard SCSI SysKonnect 9843 SX GbE 2 independent PCI buses 6 PCI-X 64 bit/133 Mhz capable 3ware 7850 RAID controller 2 Promise Ultra 100 Tx2 controllers
CERN Server SuperMicro P4DL6 (Dual Xeon 2GHz) 400 MHz front side bus 1 GB DDR2100 RAM Dual Channel Ultra 160 onboard SCSI SysKonnect 9843 SX GbE 2 independent PCI buses 6 PCI-X 64 bit/133 Mhz capable 2 3ware 7850 RAID controller 6 IDE drives on each 3-ware controllers RH7.3 on 13th drive connected to on-board IDE WD Caviar 120GB drives with 8Mbyte cache RMC4D from HARDDATA
TRIUMF Backup Server SuperMicro P4DL6 (Dual Xeon 1.8GHz) Supermicro 742I-420 17” 4U Chassis 420W Power Supply 400 MHz front side bus 1 GB DDR2100 RAM Dual Channel Ultra 160 onboard SCSI SysKonnect 9843 SX GbE 2 independent PCI buses 6 PCI-X 64bit/133 MHz capable 2 Promise Ultra 133 TX2 controllers & 1 Promise Ultra 100 TX2 controller
Back-to-back tests over 12,000km loopback using designated servers
Operating System • Redhat 7.3 based Linux kernel 2.4.18-3 • Needed to support filesystems > 1TB • Upgrades and patches • Patched to 2.4.18-10 • Intel Pro 10GbE Linux driver (early stable) • SysKonnect 9843 SX Linux driver (latest) • Ported Sylvain Ravot’s tcp tune patches
Intel 10GbE Cards • Intel kindly loaned us 2 of their Pro/10GbE LR server adapters cards despite the end of their Alpha program • based on Intel® 82597EX 10 Gigabit Ethernet Controller Note length of card!
Extreme Networks TRIUMF CERN
IDE Disk Arrays TRIUMF Send Host CERN Receive Host
Disk Read/Write Performance • TRIUMF send host: • 1 3ware 7850 and 2 Promise Ultra 100TX2 PCI controllers • 12 WD 7200 rpm UDMA 100 120 GB hard drives (1.4 TB) • Tuned for optimal read performance (227/174 MB/s) • CERN receive host: • 2 3ware 7850 64-bit/33 MHz PCI IDE controllers • 12 WD 7200 rpm UDMA 100 120 GB hard drives (1.4 TB) • Tuned for optimal write performance (295/210 MB/s)
THUNDER RAID DETAILS raidstop /dev/mdo mkraid –R /dev/md0 mkfs -t ext3 /dev/md0 mount -t ext2 /dev/mdo /raid0 /root/raidtab raiddev /dev/md0 raid-level 0 nr-raid-disks 12 persistent-superblock 1 chunk-size 512 kbytes device /dev/sdc raid-disk 0 device /dev/sdd raid-disk 1 device /dev/sde raid-disk 2 device /dev/sdf raid-disk 3 device /dev/sdg raid-disk 4 device /dev/sdh raid-disk 5 device /dev/sdi raid-disk 6 device /dev/sdj raid-disk 7 device /dev/hde raid-disk 8 device /dev/hdg raid-disk 9 device /dev/hdi raid-disk 10 device /dev/hdk raid-disk 11 } } } } 8 drives on 3-ware } } } } } } } 4 drives on 2 Promise }
Black Magic • We are novices in the art of optimizing system performance • It is also time consuming • We followed most conventional wisdom, much of which we don’t yet fully understand
Testing Methodologies • Began testing with a variety of bandwidth characterization tools • pipechar, pchar, ttcp, iperf, netpipe, pathcar, etc • Evaluated high performance file transfer applications • bbftp, bbcp, tsunami, pftp • Developed scripts to automate and to scan parameter space for a number of the tools
Disk I/O Black Magic • min max read ahead on both systems sysctl -w vm.min-readahead=127 sysctl -w vm.max-readahead=256 • bdflush on receive host sysctl -w vm.bdflush=“2 500 0 0 500 1000 60 20 0” or echo 2 500 0 0 500 1000 60 20 0 >/proc/sys/vm/bdflush • bdflush on send host sysctl -w vm.bdflush=“30 500 0 0 500 3000 60 20 0” or echo 30 500 0 0 500 3000 60 20 0 >/proc/sys/vm/bdflush
Misc. Tuning and other tips /sbin/elvtune –r 512 /dev/sdc (same for other 11 disks) /sbin/elvtune –w 1024 /dev/sdc (same for other 11 disks) -r sets the max latency that the I/O scheduler will provide on each read -w sets the max latency that the I/O scheduler will provide on each write When the /raid disk refuses to dismount! Works for kernels 2.4.11 or later. umount -l /raid (then mount & umount) ^ lazy
Disk I/O Black Magic • Disk I/O elevators (minimal impact noticed) • /sbin/elvtune • Allows some control of latency vs throughput Read_latency set to 512 (default 8192) Write_latency set to 1024 (default 16384) • atime • Disables updating the last time a file has been accessed (typically for file servers) mount –t ext2 –o noatime /dev/md0 /raid Typically, ext3 writes90Mbytes/sec while for ext2 writes 190Mbytes/sec Reads minimally affected. We always used ext2
Disk I/O Black Magic Need to have PROCESS Affinity - but this requires 2.5 kernel • IRQ Affinity [root@thunder root]# more /proc/interrupts CPU0 CPU1 0: 15723114 0 IO-APIC-edge timer 1: 12 0 IO-APIC-edge keyboard 2: 0 0 XT-PIC cascade 8: 1 0 IO-APIC-edge rtc 10: 0 0 IO-APIC-level usb-ohci 14: 22 0 IO-APIC-edge ide0 15: 227234 2 IO-APIC-edge ide1 16: 126 0 IO-APIC-level aic7xxx 17: 16 0 IO-APIC-level aic7xxx 18: 91 0 IO-APIC-level ide4, ide5, 3ware Storage Controller 20: 14 0 IO-APIC-level ide2, ide3 22: 2296662 0 IO-APIC-level SysKonnect SK-98xx 24: 2 0 IO-APIC-level eth3 26: 2296673 0 IO-APIC-level SysKonnect SK-98xx 30: 26640812 0 IO-APIC-level eth0 NMI: 0 0 LOC: 15724196 15724154 ERR: 0 MIS: 0 echo 1 >/proc/irq/18/smp_affinity use CPU0 echo 2 >/proc/irq/18/smp_affinity use CPU1 echo 3 >/proc/irq/18/smp_affinity use either cat /proc/irq/prof_cpu_mask >/proc/irq/18/smp_affinity reset to default
TCP Black Magic • Typically suggested TCP and net buffer tuning sysctl -w net.ipv4.tcp_rmem="4096 4194304 4194304" sysctl -w net.ipv4.tcp_wmem="4096 4194304 4194304" sysctl -w net.ipv4.tcp_mem="4194304 4194304 4194304" sysctl -w net.core.rmem_default=65535 sysctl -w net.core.rmem_max=8388608 sysctl -w net.core.wmem_default=65535 sysctl -w net.core.wmem_max=8388608
TCP Black Magic • Sylvain Ravot’s tcp tune patch parameters sysctl -w net.ipv4.tcp_tune=“115 115 0” • Linux 2.4 retentive TCP • Caches TCP control information for a destination for 10 mins • To avoid caching sysctl -w net.ipv4.route.flush=1
We are live continent to continent! • e2e lightpath up and running Friday Sept 20 21:45 CET traceroute to cern-10g (192.168.2.2), 30 hops max, 38 byte packets 1 cern-10g (192.168.2.2) 161.780 ms 161.760 ms 161.754 ms
BBFTP Transfer Vancouver ONS ons-van01(enet_15/1) Vancouver ONS ons-van01(enet_15/2)
BBFTP Transfer Chicago ONS GigE Port 1 Chicago ONS GigE Port 2
Tsunami Transfer Vancouver ONS ons-van01(enet_15/1) Vancouver ONS ons-van01(enet_15/2)
Tsunami Transfer Chicago ONS GigE Port 1 Chicago ONS GigE Port 2
Exceeding 1Gbit/sec … ( using tsunami)
What does it mean for TRIUMFin the long TERM • Established a relationship with a ‘grid’ of people for future networking projects • Upgraded WAN connection from 100Mbit to • 4 x 1GB Ethernet connections directly to BCNET • Canarie – educational/research network • Westgrid GRID computing • Commercial Internet • Spare (research & development) • Recognition that TRIUMF has the expertise and the Network connectivity for large scale and high speed data transfers necessary for upcoming scientific programs, ATLAS, WESTGRID, etc
Lessons Learned –1 • Linux software RAID faster than most conventional SCSI and IDE hardware RAID based systems. • One controller for each drive, more disk spindles the better • More than 2 Promises / machine possible (100/133Mhz) • Unless programs are multi-threaded or kernel permits process locking, Dual CPU will not give best performance. • Single 2.8 GHz is likely to outperform Dual 2.0 GHz, for a single purpose machine like our fileservers. • More memory the better
Misc. comments • No hardware failure – even for the 50 disks! • Largest file transferred: 114 Gbytes (Sep 24) • Tar, compressing, etc take longer than transfer • Deleting files can take a lot of time • Low cost of project - $20,000 with most of that recycled
220Mbytes/sec 175Mbytes/sec
Acknowledgements • Canarie • Bill St. Arnaud, Rene Hatem, Damir Pobric, Thomas Tam, Jun Jian • Atlas Canada • Mike Vetterli, Randall Sobie, Jim Pinfold, Pekka Sinervo, Gerald Oakham, Bob Orr, Michel Lefebrve, Richard Keeler • HEPnet Canada • Dean Karlen • TRIUMF • Renee Poutissou, Konstantin Olchanski, Mike Vetterli (SFU / Westgrid), • BCNET • Mike Hrybyk, Marilyn Hay, Dennis O’Reilly, Don McWilliams
Acknowledgements • Extreme Networks • Amyn Pirmohamed, Steven Flowers, John Casselman, Darrell Clarke, Rob Bazinet, Damaris Soellner • Intel Corporation • Hugues Morin, Caroline Larson, Peter Molnar, Harrison Li, Layne Flake, Jesse Brandeburg
Acknowledgements • Indiana University • Mark Meiss, Stephen Wallace • Caltech • Sylvain Ravot, Harvey Neuman • CERN • Olivier Martin, Paolo Moroni, Martin Fluckiger, Stanley Cannon, J.P Martin-Flatin • SURFnet/Universiteit van Amsterdam • Pieter de Boer, Dennis Paus, Erik.Radius, Erik-Jan.Bos, Leon Gommans, Bert Andree, Cees de Laat
Acknowledgements • Yotta Yotta • Geoff Hayward, Reg Joseph, Ying Xie, E. Siu • BCIT • Bill Rutherford • Jalaam • Loki Jorgensen • Netera • Gary Finley
ATLAS Canada Alberta SFU Montreal Victoria UBC Carleton York TRIUMF Toronto