140 likes | 281 Views
On-chip Parallelism. Alvin R. Lebeck CPS 220/ECE 252. Administrivia. Projects Presentations Dec 5 & 7 Documents ~10 pages Good writing is important Progress is important Final is Dec 11 (7pm to 10pm). Multithreaded Processors. Exploit thread-level parallelism to improve performance
E N D
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252
Administrivia Projects • Presentations Dec 5 & 7 • Documents ~10 pages • Good writing is important • Progress is important • Final is Dec 11 (7pm to 10pm) CPS 220
Multithreaded Processors • Exploit thread-level parallelism to improve performance • Multiple Program Counters • Thread • independent programs (multiprogramming) • threads from same program CPS 220
Deneclor HEP • General purpose scientific computer • Organized as MP • up to 16 processors • each processor multithreaded • up to 128 memory modules • up to 4 I/O cache modules • Three-input switches and chaotic routing CPS 220
HEP Processor Organization • Multiple contexts (threads) • each has own Program Status Word (PSW) • PSWs circulate in control loop • control and data loops pipelined 8 deep • PSW in control can circulate no faster than data in data loop • PSW at queue head fetches and starts execution of next instruction • Clock period: 100ns • 8 PSWs in control loop => 10MIPS • Each thread gets 1/8 the processor • Maximum performance per thread => 1.25 MIPS (And they tried to sell as supercomputer) CPS 220
Horizontal Waste Verticle Waste Simultaneous Multithreading • Goal: use hardware resources more efficiently • especially for superscalar processors • Assume 4-issue superscalar • Alpha 21464 Thread Instruction CPS 220
Operation of Simultaneous Multithreading • Standard multithreading can reduce verticle waste • Issue from multiple threads in same cock cycle • Eliminate both horizontal and verticle waste • Larger Register Files Thread Instructions Thread Instructions Standard Multithreading Simultaneous Multithreading CPS 220
Limitations of SuperScalar Architectures Instruction Fetch • branch prediction • alignment of packet of instructions Dynamic Instruction Issue • Need to identify ready instructions • Rename Table • No compares • Large number of ports (Operands x Width) • Issue Queue Size • n x Q x O x W 1 bit comparators (src and dest) • Quadratic increase in queue size with issue width • PA-8000 20% of die area to issue queue (56 instruction window) CPS 220
SuperScalar Limitations (Continued) Instruction Execute • Register File • more rename registers • more access ports • complexity quadratic with issue width • Bypass logic • complexity quadratic with issue width • wire delays • Functional Units • replicate • add ports to data cache (complexity adds to access time) CPS 220
Why Single Chip MP? • Technology Push • Benefits of wide issue are limited • Decentralized microarchitecture: easier to build several simple fast processors than one complex processor • Application Pull • Applications exhibit parallelism at different grains • < 10 instructions per cycle (Integer codes) • > 40 instructions per cycle (FP loops) CPS 220
I-Cache (32 KB) External Interface Instruction Fetch TLB Instruction Decode & Rename D-Cache (32 KB) L2 Cache (256 KB) 21 mm Clocking & Pads Reorder Buffer, Instruction Queues, and Out-of-Order Logic Integer Unit Floating Point Unit A 6-Way SuperScalar Processor 21 mm CPS 220
A 4 x 2 Single Chip Multiprocessor 21 mm Icache 1 Icache 2 External Interface Processor #1 Processor #2 L2 Cache (256 KB) Dcache 1 Dcache 2 21 mm Clocking & Pads Dcache 3 Dcache 4 L2 Communication Crossbar Processor #3 Processor #4 Icache 3 Icache 4 CPS 220
Performance Comparison CPS 220
Summary of Performance • 4 x 2 MP works well for coarse grain apps • How well would Message Passing Architecture do? • Can SUIF handle pointer intensive codes? • For “tough” codes 6-way does slightly better, but neither is > 60% better than 2-issue CPS 220