1 / 25

Ultimate Integration

Ultimate Integration. Joseph Lappa Pittsburgh Supercomputing Center ESCC/Internet2 Joint Techs Workshop. Agenda. Supercomputing 2004 Conference Application Ultimate Integration Resource Overview Did it work? What did we take from it?. Supercomputing 2004. Annual Conference

aurek
Download Presentation

Ultimate Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ultimate Integration Joseph Lappa Pittsburgh Supercomputing Center ESCC/Internet2 Joint Techs Workshop

  2. Agenda • Supercomputing 2004 Conference • Application • Ultimate Integration • Resource Overview • Did it work? • What did we take from it?

  3. Supercomputing 2004 • Annual Conference • Supercomputers • Storage • Network hardware • Original reason for application • Bandwidth Challenge • Didn’t apply due to time

  4. Application Requirements • Runs on Lemieux (PSC’s supercomputer) • Application Gateways (AGW) • Cisco CRS-1 • 40Gb/sec OC-768 cards • Few exist • Single application • Be used with another demo on the show floor if possible

  5. Ultimate Integration Application • Checkpoint Recovery System • Program • Garden variety Laplace solver instrumented to save its memory state in checkpoint files • Checkpoints memory to remote network clients • Runs on 34 Lemieux nodes

  6. Lemieux TCS System • 750 Compaq Alphaserver ES45 nodes • SMP • Four 1GHz Alpha Processors • 4 GB of Memory • Interconnection • Quadrics Cluster Interconnect • Shared memory library

  7. Application Gateways • 750 GigE connections are very expensive • Reuse Quadrics network to attach cheap Linux boxes with GigE • 15 AGWS • Single processor Xeons • 1 Quadrics card • 2 Intel GigE • Each GigE card maxes out at 990Mb/sec • Only need 30 GigE to fill link to Teragrid • Web100 kernel

  8. Application Gateways

  9. Network • Cisco 6509 • Sup720 • WS-X6748-SFP • Two WS-X6704-10GE • Used 4 10GE interfaces • OSPF load balancing was my real worry • >30 GE streams over 4 links

  10. Network • Cisco CRS-1 • 40 Gb/sec slot • 16 slots • For Demo • Two OC-768 cards • Ken Goodwin’s and Kevin McGratten’s big worry was the OC-768 transport • Two 8 Port 10 GE cards • Running production IOS-XR code • Had problems with tracking hardware • Ran both without 2 Switching Fabrics with no effects on traffic

  11. Network • Cisco CRS-1 • One at Westinghouse Machine Room • One on show floor • Fork lift needed to place it • 7 feet tall • 939 lbs empty • 1657 lbs fully loaded

  12. The Magic Box • Stratalight – OTS 4040 transponder “compresses” the 40Gbs signal to fit into the spectral bandwidth of a traditional 10G wave • http://www.stratalight.com/ • Uses proprietary encoding techniques • The Stratalight transponder was connected to the Mux/DMUX of the 15454 as an alien wavelength

  13. Time Dependences • OC-768 wasn’t worked on until one week before the conference

  14. OC-768

  15. OC-768

  16. OC-768

  17. Where Does the Data Land? • Lustre Filesystem • http://www.lustre.org/ • Developed by Cluster File Systems • http://www.clusterfs.com/ • POSIX compliant, Open Source, parallel file system • Separates metadata and data objects to allow for speed and scaling

  18. The Show Floor • 8 Checkpoint Servers with a 10GigE and Infiniband connections • 5 Lustre OSTs connected via Infiniband with 2 SCSI disk shelves (RAID5) • Lustre meta-data server (MDS) connected via Infiniband

  19. The Show Floor

  20. The Demo

  21. How well did it run? • Laplace Solver w/ Checkpoint Recovery • Using 16 Application Gateways (32 GigE connections): 31.1Gbs • Only 32 Lemieux nodes were available • IPERF • Using 17 Application Gateways + 3 single GigE attached machines: 35 Gbs • Zero SONET errors reported on interface • Over 44TB were transferred

  22. The Team

  23. Just Demoware? • AGWs • qsub command now has AGW option • Can do accounting (and possibly billing) • Mysql database with Web100 stats • Validated that AGW was cost effective solution • OC-768 Metro can be done by mere mortals

  24. Just Demoware?? • Application receiver • Laplace solver ran at PSC • Checkpoint receiver program tested / run at both NCSA and SDSC • Ten IA64 compute nodes as receiver • ~10 Gb/sec Network to Network (/dev/null) • 990 Mb/sec * 10 streams

  25. Thank You

More Related