1 / 16

CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger

CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger. Allan Snavely San Diego Supercomputer Center June 19, 1998. Background. CRAY vector computers have been the workhorses of scientific computing for over 2 decades.

booth
Download Presentation

CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CRAY T90 vs. Tera MTA:The Old Champ Facesa New Challenger Allan Snavely San Diego Supercomputer Center June 19, 1998

  2. Background • CRAY vector computers have been the workhorses of scientific computing for over 2 decades. • CRAY PVPs have been ‘effort/performance’ leaders due to vector processors, flat shared memory, and great tools. • Vector machines are still very popular in terms of number of users and available scientific applications software. • NPACI currently offers T916/14, J98/5, J916/16. • There is lots of legacy vector code, much of which will never see an MPI_Send call. • T90s are the last in the line of CRAY PVP computers.

  3. More Background • Tera has developed revolutionary new architecture, the MTA, for parallel computing with a programming model as simple as the PVP model. • MTA can exploit more levels of parallelism than T90. • First Tera machine (MTA, for MultiThreaded Architecture) was delivered to SDSC in November 1997 with a single 145 MHz processor (< 1/2 final speed). • Tera delivered a two processor system to SDSC in early 1998 with two 255 MHz (still not final) processors and a network board (not final, either), but no UNIX.

  4. Caveats, Disclaimers, and Excuses • MTA software is still being debugged. • Processors are not running at full speed: • theoretical peak is 765 Mflops/CPU (255MHz), but will rise to 0.9-1.0 Gflops • Interconnect is not up to specification: • memory-intensive codes cannot speed up by more than 1.75 until new network boards are installed • All of the above are improving daily and are production issues, not research issues. • We have had 2 processors running and a stable OS (but not UNIX yet) for only a few weeks. Time is shared w/Tera.

  5. CRAY T90 440 MHz frequency8 128-element vector registers/CPUDual vector pipes into FUsPipelines ADD and MULT unitsCan execute 4 flops/cycle (commonly 2)Flat shared memory DRAM, high bandwidth, low latencyCan issue 2 loads + 1 store / cycle Peak 1.76 Gflops / CPU Practical peak of 1 Gflops Currently observe 400-800 Mflops in 'good' user codes Tera MTA-1 300+ MHz clock (255MHz now)128 Streams (HW for threads)/CPUEffective depth of pipeline is 21Additional FMA unitCan execute 3 flops/cycle (commonly 2)Flat shared memorySRAM, moderate latency, moderate bandwidthCan issue 1 memory ref / cycle Peak 0.9+ Gflops / CPU Practical peak of 600 MflopsTera expects sustained 30-60% of peak in 'good' user codes T90/MTA Hardware Comparison

  6. NAS 2.3-Serial Benchmarks • NAS Parallel Benchmarks version 2.3 • Level 2 are not pencil-and-paper; must be executed as is or with minimal tuning • Written using MPI for distributed memory, RISC-based machines • NAS 2.3-Serial • ‘Reverse-engineered’ from NPB 2.3; MPI versions were ‘serialized’ • Not necessarily optimal for vector or multithreaded platforms ‘as is’

  7. NAS 2.3-Serial Benchmarks Results

  8. Applications Performance: Disclaimer • MTA wasn’t available long enough to port, tune many applications • 2 processors weren’t available long enough to obtain many multiprocessor results • Most tuning effort performed by Tera staff • Applications selected were not chosen for superior T90 performance: • LCPFCT performs very well on T90 • AMBER performs fairly well on T90 • LS-DYNA3D performs less well on T90 for many interesting cases

  9. LCPFCT Performance Comparison

  10. AMBER Performance Comparison

  11. LS-DYNA3D Comparison

  12. Conclusions • T90 multitasking doesn't allow the user fine control over load balancing. • Porting T90 codes to the MTA is easy. • Tuning on both platforms is facilitated by excellent compilers and simple programming models. • MTA can exploit the same parallelism in a problem which the T90 can. Can also exploit levels which the T90 doesn’t. • MTA is likely to give good performance & scalability on most T90 codes. • The T90 is still the world's fastest vector machine, but the MTA may outperform it across a wider spectrum of problems using vectors but also having more potential outer-loop, and higher level, parallelism.

  13. Future MTA Hardware Plans • 4-processor network to be delivered soon (July?) • 2 more processors delivered shortly thereafter (August?)[With each processor comes one or two 1GB memory modules (not associated directly with processor, just how network is built)] • UNIX will be completed by end of summer (Aug-Sept?) • Pending results of evaluations, increase size to 8 (end of year?), then 16 (next year) • Fortran 90, OpenMP, other tools on the way...

  14. Future Work • SC98: • updated NAS benchmarks (‘final’ processors, network) • multiprocessor benchmarks • applications as well as kernels • Applications Porting and Tuning: • More work on AMBER, LS-DYNA3D • Port GAMESS, MPIRE, OVERFLOW • Port other vendor and research codes • Suggestions? (allans@sdsc.edu)

More Related