Multithreaded Processors

Multi-Threaded Processor Architectures: Multithreaded Processors The Tera MTA Frank Casilio Computer Engineering May 15, 1997

Cache Coherence • Writes To Memory Problems with MultiProcessors • Memory Latency • Context Switching Time • Communication/Synchronization Latency • Poor Programming Model

Motivation • Reduce/Tolerate Memory Latency • General Purpose Machine • Scalability • Shared Memory • Simpler Programming Model

On-Chip Cache • Shortens Round Trip To Memory Typical Ways To Reduce Latency • Fast Buses & Networks • Hardware Synchronization • Prefetching

Experimental Systems Have Existed Since The 50’s • Only 2 Commercial Systems Ever Produced • HEP • Tera MTA Multi-Threading: The Concept • Support For Multiple Concurrent Hardware Contexts • Swap Contexts During Latencies • Tolerates Latency Instead of Reducing It

Parameters That Effect Efficiency • Number Of Contexts Supported • Switching Overhead • Run Length (Granularity) • Average Latency To Be Hidden

Two Different Types • Fine Grained • Coarse Grained Switching Theory • Determines How Often Contexts Switch • Directly Related to Cost

Requires More Contexts • Workload Requirements • Can Simplify Overall Processor Complexity Fine Grained Switching • Switches Contexts Every Cycle • Many Long Latencies Operations Tolerated

Switches Contexts After A Couple Of Cycles • Has Problems With Sporadic Latencies Coarse Grained Switching • Requires Less Contexts • Requires More Complex Processors

Scalable • Direct Relationship b/w PE’s & Throughput The TERA MTA • First Commercial Multithreaded Machine Since 1978 • Uniform Shared Memory • Fine Grained Architecture

The Tera MTA Cont’d • Torodial Interconnection • 16-256 Processor Versions • 12 Million Dollar Base System

Processor Characteristics • Support For 128 Threads • 16 Protection Domains • 0 Context Switching Overhead!!! • 1 GFLOP Peak Performance • 333 MHz Nominal Speed

Load-Store Architecture • 3 Addressing Modes • 1 Memory Reference • 1 Arithmetic Operation • 1 Control (i.e.. Branch) Processor Characteristics Cont’d • 3 Operations Per Instruction • 31 64-bit GPR’s • 6KW Of Power Dissipation Per Processor

164 Bit Packets • 64 Bits Are Data • 2.67 GB/s Bandwidth In Each Direction Interconnection Network • 3-D Torus Contains 3p/2 nodes • Packet Switching • 3 Cycles of Latency Per Node • Messages Are Assigned Random Priorities • 2 HIPPI Channels / Processor For Net Connection

Memory • Either 2p or 4p Units, Interleaved 64 Ways • 8, 16, 32 and 64 Bit Addressable • 4 Bits per Word Of Access State For Synchronization • Memory Units Equipped With Error Correcting Code • Memory Usage In Random To All Banks • 16 MB DRAM Chips

Maximum Strategy Gen5 XL RAID • Sustained Bandwidth of 130 MB/s Input / Output • 20p MB/s In Each Direction • At Least p/16 Disk Arrays Are Required • System Capacity of 300p GB

Distributed Parallel Version Of Unix • Highly Concurrent Version Of Berkeley • Two Tier Scheduler Provides Better Resource Allocation • PL Scheduler • PB Scheduler Operating System • Allows Systems To Run p Tasks Truly Parallel • Streams Are Dynamically Created w/o OS Intervention • Processes Are Broken Up Into Tasks By OS

Automatic Parallelization Of: • C, C++ & Fortran By The Compiler Software / Languages • Implicit And Explicit Parallelism Is Allowed • High Degree of Cray Compatibility • Easy To Program b/c Of Architecture

System Performance • 3.84-12.8 Times Performance Of Cray T90/32 • 1K x 1K Matrix Multiple in 50 ms • Integer Sort of 100M Keys in 36 ms

Conclusion • Proven Effectiveness • Logical Step For Multiprocessor Computers • Still Very Pricey • Allow General Purpose Workload • Scalable • Shared Memory

Questions?

Instruction Pipeline

Task Team Team Team Team VP VP VP VP VP VP VP VP Breakdown Of A Task

Deciding The Of Number Contexts

Multithreaded Processors