CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger

CRAY T90 vs. Tera MTA:The Old Champ Facesa New Challenger Allan Snavely San Diego Supercomputer Center June 19, 1998

Background • CRAY vector computers have been the workhorses of scientific computing for over 2 decades. • CRAY PVPs have been ‘effort/performance’ leaders due to vector processors, flat shared memory, and great tools. • Vector machines are still very popular in terms of number of users and available scientific applications software. • NPACI currently offers T916/14, J98/5, J916/16. • There is lots of legacy vector code, much of which will never see an MPI_Send call. • T90s are the last in the line of CRAY PVP computers.

More Background • Tera has developed revolutionary new architecture, the MTA, for parallel computing with a programming model as simple as the PVP model. • MTA can exploit more levels of parallelism than T90. • First Tera machine (MTA, for MultiThreaded Architecture) was delivered to SDSC in November 1997 with a single 145 MHz processor (< 1/2 final speed). • Tera delivered a two processor system to SDSC in early 1998 with two 255 MHz (still not final) processors and a network board (not final, either), but no UNIX.

Caveats, Disclaimers, and Excuses • MTA software is still being debugged. • Processors are not running at full speed: • theoretical peak is 765 Mflops/CPU (255MHz), but will rise to 0.9-1.0 Gflops • Interconnect is not up to specification: • memory-intensive codes cannot speed up by more than 1.75 until new network boards are installed • All of the above are improving daily and are production issues, not research issues. • We have had 2 processors running and a stable OS (but not UNIX yet) for only a few weeks. Time is shared w/Tera.

CRAY T90 440 MHz frequency8 128-element vector registers/CPUDual vector pipes into FUsPipelines ADD and MULT unitsCan execute 4 flops/cycle (commonly 2)Flat shared memory DRAM, high bandwidth, low latencyCan issue 2 loads + 1 store / cycle Peak 1.76 Gflops / CPU Practical peak of 1 Gflops Currently observe 400-800 Mflops in 'good' user codes Tera MTA-1 300+ MHz clock (255MHz now)128 Streams (HW for threads)/CPUEffective depth of pipeline is 21Additional FMA unitCan execute 3 flops/cycle (commonly 2)Flat shared memorySRAM, moderate latency, moderate bandwidthCan issue 1 memory ref / cycle Peak 0.9+ Gflops / CPU Practical peak of 600 MflopsTera expects sustained 30-60% of peak in 'good' user codes T90/MTA Hardware Comparison

NAS 2.3-Serial Benchmarks • NAS Parallel Benchmarks version 2.3 • Level 2 are not pencil-and-paper; must be executed as is or with minimal tuning • Written using MPI for distributed memory, RISC-based machines • NAS 2.3-Serial • ‘Reverse-engineered’ from NPB 2.3; MPI versions were ‘serialized’ • Not necessarily optimal for vector or multithreaded platforms ‘as is’

NAS 2.3-Serial Benchmarks Results

Applications Performance: Disclaimer • MTA wasn’t available long enough to port, tune many applications • 2 processors weren’t available long enough to obtain many multiprocessor results • Most tuning effort performed by Tera staff • Applications selected were not chosen for superior T90 performance: • LCPFCT performs very well on T90 • AMBER performs fairly well on T90 • LS-DYNA3D performs less well on T90 for many interesting cases

LCPFCT Performance Comparison

AMBER Performance Comparison

LS-DYNA3D Comparison

Conclusions • T90 multitasking doesn't allow the user fine control over load balancing. • Porting T90 codes to the MTA is easy. • Tuning on both platforms is facilitated by excellent compilers and simple programming models. • MTA can exploit the same parallelism in a problem which the T90 can. Can also exploit levels which the T90 doesn’t. • MTA is likely to give good performance & scalability on most T90 codes. • The T90 is still the world's fastest vector machine, but the MTA may outperform it across a wider spectrum of problems using vectors but also having more potential outer-loop, and higher level, parallelism.

Future MTA Hardware Plans • 4-processor network to be delivered soon (July?) • 2 more processors delivered shortly thereafter (August?)[With each processor comes one or two 1GB memory modules (not associated directly with processor, just how network is built)] • UNIX will be completed by end of summer (Aug-Sept?) • Pending results of evaluations, increase size to 8 (end of year?), then 16 (next year) • Fortran 90, OpenMP, other tools on the way...

Future Work • SC98: • updated NAS benchmarks (‘final’ processors, network) • multiprocessor benchmarks • applications as well as kernels • Applications Porting and Tuning: • More work on AMBER, LS-DYNA3D • Port GAMESS, MPIRE, OVERFLOW • Port other vendor and research codes • Suggestions? (allans@sdsc.edu)

CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger

CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger

Presentation Transcript

CHAMP Pain Control

Pipelining and Vector Processing

Ronald C. Hamdy, MD, FRCP, FACP Professor of Medicine

Ajax-Enabled JavaServer™ Faces Web Applications

Eigen representations; Detecting faces in images

Faces of Uzbekistan

Memory

Challenger Mission:

Chapter 27 JavaServer Pages and Servlets

Econ 240A

Decision Analysis