260 likes | 357 Views
Processor Architecture. Microarchitectural Wire Management for Performance and Power in partitioned architectures. Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatand Venkatachalapathy. Overview/Motivation . Wire delays hamper performance.
E N D
Processor Architecture Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen Muralimanohar Karthik Ramani Venkatand Venkatachalapathy University of Utah
Overview/Motivation • Wire delays hamper performance. • Power incurred in movement of data • 50% of dynamic power is in interconnect switching (Magen et al. SLIP 04) • MIT Raw processor’s on-chip network consumes 36% of total chip power (Wang et al. 2003) • Abundant number of metal layers
Wire characteristics • Wire Resistance and capacitance per unit length • Width R , C • Spacing C • Delay (as delay RC), Bandwidth
Design space exploration • Tuning wire width and spacing 2d d Resistance Resistance B Wires Capacitance Capacitance Bandwidth
Transmission Lines • Similar to L wires - extremely low delay • Constraining implementation requirements! • Large width • Large spacing between wires • Design of sensing circuits • Implemented in test CMOS chips
Design space exploration • Tuning Repeater size and spacing Delay Power Traditional Wires Large repeaters Optimum spacing Power Optimal Wires Smaller repeaters Increased spacing
Design space exploration Delay Optimized B wires Bandwidth Optimized W wires Power Optimized P wires Power and B/W Optimized PW wires Fast, low bandwidth L wires
Heterogeneous Interconnects • Intercluster global Interconnect • 72 B wires • Repeaters sized and spaced for optimum delay • 18 L wires • Wide wires and large spacing • Occupies more area • Low latencies • 144 PW wires • Poor delay • High bandwidth • Low power
Outline • Overview • Design Space Exploration • Heterogeneous Interconnects • Employing L wires for performance • PW wires: The power optimizers • Evaluation • Results • Conclusion
L1 Cache pipeline Cache Access 5c Eff. Address Transfer 10c L S Q L1 D Cache Data return at 20c Mem. Dep Resolution 5c
Exploiting L-Wires Cache Access 5c Eff. Address Transfer 10c L S Q L1 D Cache 8-bit Transfer 5c Data return at 14c Partial Mem. Dep Resolution 3c
L wires: Accelerating cache access • Transmit LSB bits of effective address through L wires • Partial comparison of loads and stores in LSQ • Faster memory disambiguation • Introduces false dependences ( < 9%) • Indexing data and tag RAM arrays • LSB bits can prefetch data out of L1$ • Reduce access latency of loads
L wires: Narrow bit width operands • Transfer of 10 bit integers on L wires • Schedule wake up operations • Reduction in branch mispredict penalty • A predictor table of 8K two bit counters • Identifies 95% of all narrow bit-width results • Accuracy of 98% • Implemented in the PowerPC!
PW wires: Power/Bandwidth efficient • Idea: steer non-critical data through energy efficient PW interconnect • Transfer of data at instruction dispatch • Transfer of input operands to remote register file • Covered by long dispatch to issue latency • Store data
Evaluation methodology • A dynamically scheduled clustered modeled with 4 clusters in simplescalar-3.0 • Crossbar interconnects • Centralized front-end • I-Cache & D-Cache • LSQ • Branch Predictor L1 D Cache Cluster B wires (2 cycles) L wires (1 cycle) PW wires (3 cycles)
Evaluation methodology • A dynamically scheduled 16 cluster modeled in Simplescalar-3.0 • Ring latencies • B wires ( 4 cycles) • PW wires ( 6 cycles) • L wires (2 cycles) D-cache I-Cache Cluster LSQ Cross bar Ring interconnect
IPC improvements: L wires L wires improves performance by 4% on four cluster system and 7.1% on a sixteen cluster system
Conclusions • Exposing the wire design space to the architecture • A case for micro-architectural wire management! • A low latency low bandwidth network alone helps improve performance by upto 7% • ED2 improvements of about 11% compared to a baseline processor with homogeneous interconnect • Entails hardware complexity
Future work • A preliminary evaluation looks promising • Heterogeneous interconnect entails complexity • Design of heterogeneous clusters • Energy efficient interconnect
Questions and Comments? Thank you!
L wires: Accelerating cache access • TLB access for page look up • Transmit a few bits of Virtual page number on L wires • Prefetch data our of L1$ and TLB • 18 L wires( 6 tag bits, 8 L1 index and 4 TLB index bits)
Model parameters • Simplescalar-3.0 with separate integer and floating point queues • 32 KB 2 way Instruction cache • 32 KB 4 way Data cache • 128 entry 8 way I and D TLB
Overview/Motivation: • Three wire implementations employed in this study • B wires: traditional • Optimal delay • Huge power consumption • L wires: • Faster than B wires • Lesser bandwidth • PW wires: • Reduced power consumption • Higher bandwidth compared to B wires • Increased delay through the wires