250 likes | 369 Views
Multi-Threaded Processor Architectures:. Multithreaded Processors. The Tera MTA. Frank Casilio. Computer Engineering. May 15, 1997. Cache Coherence. Writes To Memory. Problems with MultiProcessors. Memory Latency. Context Switching Time. Communication/Synchronization Latency.
E N D
Multi-Threaded Processor Architectures: Multithreaded Processors The Tera MTA Frank Casilio Computer Engineering May 15, 1997
Cache Coherence • Writes To Memory Problems with MultiProcessors • Memory Latency • Context Switching Time • Communication/Synchronization Latency • Poor Programming Model
Motivation • Reduce/Tolerate Memory Latency • General Purpose Machine • Scalability • Shared Memory • Simpler Programming Model
On-Chip Cache • Shortens Round Trip To Memory Typical Ways To Reduce Latency • Fast Buses & Networks • Hardware Synchronization • Prefetching
Experimental Systems Have Existed Since The 50’s • Only 2 Commercial Systems Ever Produced • HEP • Tera MTA Multi-Threading: The Concept • Support For Multiple Concurrent Hardware Contexts • Swap Contexts During Latencies • Tolerates Latency Instead of Reducing It
Parameters That Effect Efficiency • Number Of Contexts Supported • Switching Overhead • Run Length (Granularity) • Average Latency To Be Hidden
Two Different Types • Fine Grained • Coarse Grained Switching Theory • Determines How Often Contexts Switch • Directly Related to Cost
Requires More Contexts • Workload Requirements • Can Simplify Overall Processor Complexity Fine Grained Switching • Switches Contexts Every Cycle • Many Long Latencies Operations Tolerated
Switches Contexts After A Couple Of Cycles • Has Problems With Sporadic Latencies Coarse Grained Switching • Requires Less Contexts • Requires More Complex Processors
Scalable • Direct Relationship b/w PE’s & Throughput The TERA MTA • First Commercial Multithreaded Machine Since 1978 • Uniform Shared Memory • Fine Grained Architecture
The Tera MTA Cont’d • Torodial Interconnection • 16-256 Processor Versions • 12 Million Dollar Base System
Processor Characteristics • Support For 128 Threads • 16 Protection Domains • 0 Context Switching Overhead!!! • 1 GFLOP Peak Performance • 333 MHz Nominal Speed
Load-Store Architecture • 3 Addressing Modes • 1 Memory Reference • 1 Arithmetic Operation • 1 Control (i.e.. Branch) Processor Characteristics Cont’d • 3 Operations Per Instruction • 31 64-bit GPR’s • 6KW Of Power Dissipation Per Processor
164 Bit Packets • 64 Bits Are Data • 2.67 GB/s Bandwidth In Each Direction Interconnection Network • 3-D Torus Contains 3p/2 nodes • Packet Switching • 3 Cycles of Latency Per Node • Messages Are Assigned Random Priorities • 2 HIPPI Channels / Processor For Net Connection
Memory • Either 2p or 4p Units, Interleaved 64 Ways • 8, 16, 32 and 64 Bit Addressable • 4 Bits per Word Of Access State For Synchronization • Memory Units Equipped With Error Correcting Code • Memory Usage In Random To All Banks • 16 MB DRAM Chips
Maximum Strategy Gen5 XL RAID • Sustained Bandwidth of 130 MB/s Input / Output • 20p MB/s In Each Direction • At Least p/16 Disk Arrays Are Required • System Capacity of 300p GB
Distributed Parallel Version Of Unix • Highly Concurrent Version Of Berkeley • Two Tier Scheduler Provides Better Resource Allocation • PL Scheduler • PB Scheduler Operating System • Allows Systems To Run p Tasks Truly Parallel • Streams Are Dynamically Created w/o OS Intervention • Processes Are Broken Up Into Tasks By OS
Automatic Parallelization Of: • C, C++ & Fortran By The Compiler Software / Languages • Implicit And Explicit Parallelism Is Allowed • High Degree of Cray Compatibility • Easy To Program b/c Of Architecture
System Performance • 3.84-12.8 Times Performance Of Cray T90/32 • 1K x 1K Matrix Multiple in 50 ms • Integer Sort of 100M Keys in 36 ms
Conclusion • Proven Effectiveness • Logical Step For Multiprocessor Computers • Still Very Pricey • Allow General Purpose Workload • Scalable • Shared Memory
Task Team Team Team Team VP VP VP VP VP VP VP VP Breakdown Of A Task