Tera MTA (Multi-Threaded Architecture)

Tera MTA(Multi-Threaded Architecture) Thriveni Movva (CMPS 5433)

Presentation Contains • Evolution of Tera MTA • Design goals of Tera MTA • Tera MTA Architecture • Interconnection Network • Applications • Advantages & Drawbacks • Current MTA Status

Evolution Of Tera MTA • 1987: Tera Computer Company was established by Burton Smith in Washington, USA • 1988: Software development starts • 1991: Hardware development starts • 1997:First MTA-1shipment to SDSC (San Diego Supercomputer Center)

Tera MTA: Design Goals • To solves the two major problems then faced by high-performance parallel computers • scalability • Programmability • To be suitable for very high-speed implementations • The architecture to be applicable to a wide spectrum of problems. • To Ease compiler implementation • To overcome John von Neumann’s bottleneck (a problem of memory usage)

About Tera MTA • The Tera MTA is a high performance system having • scalar multithreaded processors with synchronization among threads • uniform access shared memory i.e all data accessible with equal ease -No locality - No cache - No mapping • simple programming • zero cost context switching

About Multi-Threading architecture (MTA) • Uses a new technique called Multi-threading that lets multiprocessors share memory without using caches • Because these multi-threaded architecture computers can have thousands of processors that stay almost constantly busy, there will be no waits for slow memory accesses • Multi-threading allows each processor to switch thread contexts between execution cycles and as a result the processor stays busy • Whenever a processor starts a slow memory or I/O instruction, rather than waiting tens of cycles for the stalled instruction to complete, the processor executes its next instruction from a different thread using different registers • Each processor has many copies of the programming and pipeline control registers, one copy for each execution thread that it can support

Tera MTA Overview • Up to 256 processors with each processor running @ 260MHz • Up to 128 active threads per processor • Up to 256 I/O processors • Peak Performance of 256 GFlop/sec • Processors and memory modules populate a sparse 3D torus interconnection network • 4096 interconnection network nodes • Flat, shared main memory ranging from 16 to 512 GB • Cost : $5 million to $40 million

A View of the Tera Multiprocessor

Key Architecture Details • Each MTA processor has 128 “streams” each of which is hardware (including 32 registers and a program counter that is devoted to running single thread of control • The processor executes instructions from streams, that are not blocked, in a fair round robin fashion • A stream can issue an instruction every 21 cycles (the length of the instruction pipeline) so at least 21 ready threads are required to keep a processor fully busy • The processor makes a context switch on each cycle, choosing the next instruction from one of the streams that is ready to execute • Using ‘rich’ interconnect network guarantees that any potential delays caused by references to data in memory are completely hidden • Randomized memory mapping and high interconnectivity network provide near-uniform access time from any processor to any memory location.

Key Architecture Details • Hardware multithreading is used to tolerate high latencies to memory. This latency is typically on the order of 150 clock cycles • Expected benefits of the MTA include high processor utilization, near linear scalability, and reduced programming effort specially compared to distributed memory machines using explicit message passing • The current MTA interconnect network is a 3–D toroidal mesh

Tera MTA’S Interconnection Network • The interconnection network is a three-dimensional sparsely populated torus of pipelined packet-switching nodes, each of which is linked to some of its neighbors • Each link can transport a packet-containing source and destination addresses, an operation, and 64 data bits in both directions simultaneously on every clock tick. • Some of the nodes are also linked to resources, i.e., processors, data memory units, I/O processors, and I/O cache units. • Instead of locating the processors on one side of the network and the memories on the other, the resources are distributed more-or-less uniformly throughout the network.

Tera MTA’S Interconnection Network • The interconnection network of one 256-processor Tera system contains 4096 nodes arranged in a 16*16*16 toroidal mesh • As the Tera architecture scales to larger numbers of processors p, the number of network nodes grows as p3/2 rather than as the p log p associated with the more commonly used multistage networks. For example, a 1024-processor system would have 32,768 nodes

Multithreading on one processor Unused streams

Multithreading on multiple processors

Latency Tolerance In Tera MTA • The latency incurred in memory references is hidden by multithreading • As there may be up to 128 instruction streams (threads) and 8 memory references can be issued without waiting for the preceding ones, a latency of 1024 cycles can be tolerated • The lookahead allows threads to achieve peak performance. • Three operations (M, A, C) can be executed simultaneously per instruction per processor

The Tera Idea: Higher investment in hardware yields improved utilization and reduces software overhead

Tera MTA Applications • PULSE 3D, used for simulating real-time heartbeats to better treat heart diseases. • MSC Software’s NASTRAN, a structural analysis code used extensively by the automobile and aerospace industries. • Livermore Software's LS-DYNA, which can simulate physical occurrences such as car crashes and metal stamping. • GAUSSIAN 98, a computational chemistry application used in molecular modeling. • MPIRE (for Massively Parallel Interactive Rendering Environment), a powerful graphics and animation application that visualizes complex phenomena. • Used in seismic analysis, national security and weather forecasting.

Advantages of Tera MTA • Tera MTA uses multiple contexts to hide latency • Tera machines perform a context switch every clock cycle • Both pipeline latency and memory latency are hidden in the Tera approach • The thread creation is very cheap • With 128 contexts per processor, a large number(2k) of registers must be shared finely between threads • As long as there is plenty of parallelism in user programs to hide latency and plenty of compiler support, the performance is potentially very high. • The advantages of Tera's architecture are available to users via minimal changes to their application code.

Drawbacks of Tera MTA • The performance will be bad for limited parallelism, such as guaranteed low single-context performance. • A large number of contexts demands lots of registers and other hardware resources which in turn implies higher cost and complexity. • Finally, the limited focus on latency reduction and caching entails lots of slack parallelism to hide latency as well as lots of memory bandwidth; both require a higher cost for building the machine. • Bandwidth (not latency) limits practical MTA system size and large MTA systems will have expensive memory networks.

Tera MTA: Tools Tera provides two powerful tools Traceview and Canal that allow the programmer to: • Understand how the compiler has multithreaded a program • How effectively the program actually utilizes the hardware.

Customers • San Diego Supercomputer Center (SDSC) • Logicon, under a Naval research Lab • Tera computer company

Tera MTA Macro Architecture

Problems Solved using Tera MTA • irregular memory access patterns • Synchronization among threads • load balancing

Current Industry Status: Cray Inc (ex-Tera) • 1972：Est. by Seymour Crayin Minnesota, USA • 1976：First Cray-1 shipment to Los Alamos • 1980s： Ship follow-on products • Cray XMP， Cray YMP, Cray-2 • 1990s： More follow-on products • Cray C90， Cray J90，Cray T3D • Cray T90， Cray T3E, Cray SV1 • 1996： Merged with Silicon Graphics（SGI) • 1987：Est. by Burton Smith in Washington, USA • 1988：Software development starts • 1991：Hardware development starts • 1997： First MTA-1shipment to SDSC (San Diego • Supercomputer Center) • 2000：Purchased Cray business unit from SGI • Cray Inc. (Nasdaq NM: CRAY) Est.: April 1, 2000 （Tera Computer + Cray Research) HQ: Seattle WA, USA Products: Supercomputers （Vector, Micro Processor, Multithread） Market: Government, Industry, Academic Research

Cray Inc. (2000–present; result of merger between Tera Computers and Cray Research) • Cray SX-6 • Cray MTA-2 • Cray SV1 • Cray Red Storm • Cray X1 • Cray XD1

Cray MTA-2 , Multi-threaded Architecture 128 Virtual Processors in a CPU module Up to 1TB Scalable Shared memory Zero Overhead Thread Switching

Cray MTA-2 Overview Multithreadsystem Cray MTA-2

Unique capability of Cray MTA Visualization of Nebula using MPIRE Application on Cray MTA system

References • http://www.hoise.com/vmw/00/articles/vmw/JH-VM-01-00-1.html • http://www.cs.njit.edu/pact/eight/tutorial/tera.html • http://techreports.larc.nasa.gov/icase/1998/icase-1998-interim33.pdf • http://www.bearcave.com/misl/misl_tech/venture_capital.html

Tera MTA (Multi-Threaded Architecture)

Tera MTA (Multi-Threaded Architecture)

Presentation Transcript

UCoM Software Architecture

Multi-threaded RTOS

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance

Processes

How to Build Multi-threaded Applications in .NET Mazen S. Alzogbi Technology Specialist Microsoft Corporation

Python threads: Dive into GIL!

Multi-threaded applications

Multi-threaded Reachability

Tera MTA (Multi-Threaded Architecture)

Web Crawler with Word Count – Single and Multi Threaded with GAE

Revisiting Parallelism

Multi-threaded Reachability

Computer architecture II

tera monroe

Multi-Threaded Transactions

Cache Interference I

Parallelism (Multi-threaded)

Best Practices for Multi-threading

Multi-threaded RTOS

Computer Assisted Minimal Invasive Surgery towards Guided Motor Control

Multi-threaded ROOT

Cache Interference