Atlas Canada Lightpath Data Transfer Trial

Atlas Canada Lightpath Data Transfer Trial Corrie Kost, Steve McDonald (TRIUMF) Bryan Caron (UofAlberta), Wade Hong (Carleton)

Brownie 2.5 TeraByte RAID array • 16 x 160 GB IDE disks (5400 rpm 2MB cache) • hot swap capable • Dual ultra160 SCSI interface to host • Maximum transfer ~65 MB/sec • Triple hot swap power supplies • CAN ~$15k • Arrives July 8th 2002

What to Do while waiting for server to arrive • IBM PRO6850 Intellistation (Loan) • Dual 2.2 GHz Xeons • 2 PCI 64bit/66MHz • 4 PCI 33bit/33MHz • 1.5 GB RAMBUS • Add 2 Promise Ultra100 • IDE controllers and 5 Disks • Each disk on its own IDE controller for maximum IO • Begin Linux Software RAID performance tests ~170/130 MB/sec Read/Write

The Long Road to High Disk IO • IBM cluster x330’s RH7.2 disk io ~ 15 MB/sec (slow??) • expect 45 MB/sec for any modern single drive • Need 2.4.18 Linux kernel to support >1TB filesystems • IBM cluster x330’s RH7.3 disk io ~ 3 MB/sec • What is going on • Red Hat modified serverworks driver broke DMA on x330’s • x330’s ATA 100 drive, BUT controller is only UDMA 33 • Promise controllers capable of UDMA 100 but need latest kernel patches for 2.4.18 before drives recognise UDMA100 • Finally drives/controller both working at UDMA100 = 45MB/sec • Linux software raid0 2 drives 90MB/sec, 3 drives 125 MB/sec • 4 drives 155MB/sec, 5 drives 175 MB/sec • Now we are ready to start network transfers

did we So what are we going to do? ---------------------------------- • Demonstrate a manually provisioned “e2e” lightpath • Transfer 1TB of ATLAS MC data generated in Canada from TRIUMF to CERN • Test out 10GbE technology and channel bonding • Establish a new benchmark for high performance disk to disk throughput over a large distance

Comparative Results(TRIUMF to CERN)

What is an e2e Lightpath • Core design principle of CA*net 4 • Ultimately to give control of lightpath creation, teardown and routing to the end user • Hence, “Customer Empowered Networks” • Provides a flexible infrastructure for emerging grid applications • Alas, can only do things manually today

CA*net 4 Layer 1 Topology

The Chicago Loopback • Need to test TCP/IP and Tsunami protocols over long distances, arrange optical loop via StarLight • ( TRIUMF-BCNET-Chicago-BCNET-TRIUMF ) • ~91ms RTT • TRIUMF - CERN RRT ~200ms Told Damir, we really needed to have a double loopback • “No problem” • The loopback2 was setup a few days later (RTT=193ms) • (TRIUMF-BCNET-Chicago-BCNET-Chicago-BCNET-TRIUMF)

TRIUMF Server SuperMicro P4DL6 (Dual Xeon 2GHz) 400 MHz front side bus 1 GB DDR2100 RAM Dual Channel Ultra 160 onboard SCSI SysKonnect 9843 SX GbE 2 independent PCI buses 6 PCI-X 64 bit/133 Mhz capable 3ware 7850 RAID controller 2 Promise Ultra 100 Tx2 controllers

CERN Server SuperMicro P4DL6 (Dual Xeon 2GHz) 400 MHz front side bus 1 GB DDR2100 RAM Dual Channel Ultra 160 onboard SCSI SysKonnect 9843 SX GbE 2 independent PCI buses 6 PCI-X 64 bit/133 Mhz capable 2 3ware 7850 RAID controller 6 IDE drives on each 3-ware controllers RH7.3 on 13th drive connected to on-board IDE WD Caviar 120GB drives with 8Mbyte cache RMC4D from HARDDATA

TRIUMF Backup Server SuperMicro P4DL6 (Dual Xeon 1.8GHz) Supermicro 742I-420 17” 4U Chassis 420W Power Supply 400 MHz front side bus 1 GB DDR2100 RAM Dual Channel Ultra 160 onboard SCSI SysKonnect 9843 SX GbE 2 independent PCI buses 6 PCI-X 64bit/133 MHz capable 2 Promise Ultra 133 TX2 controllers & 1 Promise Ultra 100 TX2 controller

Back-to-back tests over 12,000km loopback using designated servers

Operating System • Redhat 7.3 based Linux kernel 2.4.18-3 • Needed to support filesystems > 1TB • Upgrades and patches • Patched to 2.4.18-10 • Intel Pro 10GbE Linux driver (early stable) • SysKonnect 9843 SX Linux driver (latest) • Ported Sylvain Ravot’s tcp tune patches

Intel 10GbE Cards • Intel kindly loaned us 2 of their Pro/10GbE LR server adapters cards despite the end of their Alpha program • based on Intel® 82597EX 10 Gigabit Ethernet Controller Note length of card!

Extreme Networks TRIUMF CERN

EXTREME NETWORK HARDWARE

IDE Disk Arrays TRIUMF Send Host CERN Receive Host

Disk Read/Write Performance • TRIUMF send host: • 1 3ware 7850 and 2 Promise Ultra 100TX2 PCI controllers • 12 WD 7200 rpm UDMA 100 120 GB hard drives (1.4 TB) • Tuned for optimal read performance (227/174 MB/s) • CERN receive host: • 2 3ware 7850 64-bit/33 MHz PCI IDE controllers • 12 WD 7200 rpm UDMA 100 120 GB hard drives (1.4 TB) • Tuned for optimal write performance (295/210 MB/s)

THUNDER RAID DETAILS raidstop /dev/mdo mkraid –R /dev/md0 mkfs -t ext3 /dev/md0 mount -t ext2 /dev/mdo /raid0 /root/raidtab raiddev /dev/md0 raid-level 0 nr-raid-disks 12 persistent-superblock 1 chunk-size 512 kbytes device /dev/sdc raid-disk 0 device /dev/sdd raid-disk 1 device /dev/sde raid-disk 2 device /dev/sdf raid-disk 3 device /dev/sdg raid-disk 4 device /dev/sdh raid-disk 5 device /dev/sdi raid-disk 6 device /dev/sdj raid-disk 7 device /dev/hde raid-disk 8 device /dev/hdg raid-disk 9 device /dev/hdi raid-disk 10 device /dev/hdk raid-disk 11 } } } } 8 drives on 3-ware } } } } } } } 4 drives on 2 Promise }

Black Magic • We are novices in the art of optimizing system performance • It is also time consuming • We followed most conventional wisdom, much of which we don’t yet fully understand

Testing Methodologies • Began testing with a variety of bandwidth characterization tools • pipechar, pchar, ttcp, iperf, netpipe, pathcar, etc • Evaluated high performance file transfer applications • bbftp, bbcp, tsunami, pftp • Developed scripts to automate and to scan parameter space for a number of the tools

Disk I/O Black Magic • min max read ahead on both systems sysctl -w vm.min-readahead=127 sysctl -w vm.max-readahead=256 • bdflush on receive host sysctl -w vm.bdflush=“2 500 0 0 500 1000 60 20 0” or echo 2 500 0 0 500 1000 60 20 0 >/proc/sys/vm/bdflush • bdflush on send host sysctl -w vm.bdflush=“30 500 0 0 500 3000 60 20 0” or echo 30 500 0 0 500 3000 60 20 0 >/proc/sys/vm/bdflush

Misc. Tuning and other tips /sbin/elvtune –r 512 /dev/sdc (same for other 11 disks) /sbin/elvtune –w 1024 /dev/sdc (same for other 11 disks) -r sets the max latency that the I/O scheduler will provide on each read -w sets the max latency that the I/O scheduler will provide on each write When the /raid disk refuses to dismount! Works for kernels 2.4.11 or later. umount -l /raid (then mount & umount) ^ lazy

Disk I/O Black Magic • Disk I/O elevators (minimal impact noticed) • /sbin/elvtune • Allows some control of latency vs throughput Read_latency set to 512 (default 8192) Write_latency set to 1024 (default 16384) • atime • Disables updating the last time a file has been accessed (typically for file servers) mount –t ext2 –o noatime /dev/md0 /raid Typically, ext3 writes90Mbytes/sec while for ext2 writes 190Mbytes/sec Reads minimally affected. We always used ext2

Disk I/O Black Magic Need to have PROCESS Affinity - but this requires 2.5 kernel • IRQ Affinity [root@thunder root]# more /proc/interrupts CPU0 CPU1 0: 15723114 0 IO-APIC-edge timer 1: 12 0 IO-APIC-edge keyboard 2: 0 0 XT-PIC cascade 8: 1 0 IO-APIC-edge rtc 10: 0 0 IO-APIC-level usb-ohci 14: 22 0 IO-APIC-edge ide0 15: 227234 2 IO-APIC-edge ide1 16: 126 0 IO-APIC-level aic7xxx 17: 16 0 IO-APIC-level aic7xxx 18: 91 0 IO-APIC-level ide4, ide5, 3ware Storage Controller 20: 14 0 IO-APIC-level ide2, ide3 22: 2296662 0 IO-APIC-level SysKonnect SK-98xx 24: 2 0 IO-APIC-level eth3 26: 2296673 0 IO-APIC-level SysKonnect SK-98xx 30: 26640812 0 IO-APIC-level eth0 NMI: 0 0 LOC: 15724196 15724154 ERR: 0 MIS: 0 echo 1 >/proc/irq/18/smp_affinity  use CPU0 echo 2 >/proc/irq/18/smp_affinity  use CPU1 echo 3 >/proc/irq/18/smp_affinity  use either cat /proc/irq/prof_cpu_mask >/proc/irq/18/smp_affinity  reset to default

TCP Black Magic • Typically suggested TCP and net buffer tuning sysctl -w net.ipv4.tcp_rmem="4096 4194304 4194304" sysctl -w net.ipv4.tcp_wmem="4096 4194304 4194304" sysctl -w net.ipv4.tcp_mem="4194304 4194304 4194304" sysctl -w net.core.rmem_default=65535 sysctl -w net.core.rmem_max=8388608 sysctl -w net.core.wmem_default=65535 sysctl -w net.core.wmem_max=8388608

TCP Black Magic • Sylvain Ravot’s tcp tune patch parameters sysctl -w net.ipv4.tcp_tune=“115 115 0” • Linux 2.4 retentive TCP • Caches TCP control information for a destination for 10 mins • To avoid caching sysctl -w net.ipv4.route.flush=1

We are live continent to continent! • e2e lightpath up and running Friday Sept 20 21:45 CET traceroute to cern-10g (192.168.2.2), 30 hops max, 38 byte packets 1 cern-10g (192.168.2.2) 161.780 ms 161.760 ms 161.754 ms

BBFTP Transfer Vancouver ONS ons-van01(enet_15/1) Vancouver ONS ons-van01(enet_15/2)

BBFTP Transfer Chicago ONS GigE Port 1 Chicago ONS GigE Port 2

Tsunami Transfer Vancouver ONS ons-van01(enet_15/1) Vancouver ONS ons-van01(enet_15/2)

Tsunami Transfer Chicago ONS GigE Port 1 Chicago ONS GigE Port 2

Sunday Nite Summaries

Exceeding 1Gbit/sec … ( using tsunami)

What does it mean for TRIUMFin the long TERM • Established a relationship with a ‘grid’ of people for future networking projects • Upgraded WAN connection from 100Mbit to • 4 x 1GB Ethernet connections directly to BCNET • Canarie – educational/research network • Westgrid GRID computing • Commercial Internet • Spare (research & development) • Recognition that TRIUMF has the expertise and the Network connectivity for large scale and high speed data transfers necessary for upcoming scientific programs, ATLAS, WESTGRID, etc

Lessons Learned –1 • Linux software RAID faster than most conventional SCSI and IDE hardware RAID based systems. • One controller for each drive, more disk spindles the better • More than 2 Promises / machine possible (100/133Mhz) • Unless programs are multi-threaded or kernel permits process locking, Dual CPU will not give best performance. • Single 2.8 GHz is likely to outperform Dual 2.0 GHz, for a single purpose machine like our fileservers. • More memory the better

Misc. comments • No hardware failure – even for the 50 disks! • Largest file transferred: 114 Gbytes (Sep 24) • Tar, compressing, etc take longer than transfer • Deleting files can take a lot of time • Low cost of project - $20,000 with most of that recycled

220Mbytes/sec 175Mbytes/sec

Acknowledgements • Canarie • Bill St. Arnaud, Rene Hatem, Damir Pobric, Thomas Tam, Jun Jian • Atlas Canada • Mike Vetterli, Randall Sobie, Jim Pinfold, Pekka Sinervo, Gerald Oakham, Bob Orr, Michel Lefebrve, Richard Keeler • HEPnet Canada • Dean Karlen • TRIUMF • Renee Poutissou, Konstantin Olchanski, Mike Vetterli (SFU / Westgrid), • BCNET • Mike Hrybyk, Marilyn Hay, Dennis O’Reilly, Don McWilliams

Acknowledgements • Extreme Networks • Amyn Pirmohamed, Steven Flowers, John Casselman, Darrell Clarke, Rob Bazinet, Damaris Soellner • Intel Corporation • Hugues Morin, Caroline Larson, Peter Molnar, Harrison Li, Layne Flake, Jesse Brandeburg

Acknowledgements • Indiana University • Mark Meiss, Stephen Wallace • Caltech • Sylvain Ravot, Harvey Neuman • CERN • Olivier Martin, Paolo Moroni, Martin Fluckiger, Stanley Cannon, J.P Martin-Flatin • SURFnet/Universiteit van Amsterdam • Pieter de Boer, Dennis Paus, Erik.Radius, Erik-Jan.Bos, Leon Gommans, Bert Andree, Cees de Laat

Acknowledgements • Yotta Yotta • Geoff Hayward, Reg Joseph, Ying Xie, E. Siu • BCIT • Bill Rutherford • Jalaam • Loki Jorgensen • Netera • Gary Finley

ATLAS Canada Alberta SFU Montreal Victoria UBC Carleton York TRIUMF Toronto

Atlas Canada Lightpath Data Transfer Trial

Atlas Canada Lightpath Data Transfer Trial

Presentation Transcript

Atlas Canada Lightpath Data Transfer Trial

TAXUS ATLAS Trial

The ATLAS data flow

Data Transfer

The ATLAS Data Model

ATLAS Data Distribution

ATLAS Data Dictionary

ATLAS Data Challenges

LightPath Networking

Data Transfer

ATLAS October Data Transfer Functional Test

OMNI-View Lightpath map

ATLAS Data Challenges

ATLAS Data Challenges

Canada and the ATLAS Experiment

ATLAS Data Challenges

ATLAS Meta-data Workshop

Finding Data in ATLAS

LightPath Management

Data Transfer

ATLAS Distributed Data Management

Finding Data in ATLAS