200 likes | 431 Views
A Distributed Control Path Architecture for VLIW Processors. Hongtao Zhong, Kevin Fan, Scott Mahlke, and Michael Schlansker* Advanced Computer Architecture Laboratory University of Michigan *HP Laboratories. Motivation. VLIW Scaling Problem Centralized resource Highly ported structures
E N D
A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan, Scott Mahlke, and Michael Schlansker* Advanced Computer Architecture Laboratory University of Michigan *HP Laboratories 1
Motivation • VLIW Scaling Problem • Centralized resource • Highly ported structures • Wire delays Register File Register File FU FU FU FU FU FU FU FU FU FU … Instruction Fetch/Decode Instruction Fetch/Decode 2
Multicluster VLIW • Distribute register files • Cluster function units • Distribute data caches • Clusters communicate through interconnection network • Used in TI C6x, Lx/ST200, Analog Tigersharc Interconnection network Cluster 0 Cluster 1 Register File Register File FU FU FU FU Instruction Fetch/Decode 3
Control Path Scaling Problem • Larger I-cache • Latency • Long wires for control signals distribution • Code compression • Hardware cost, power • Grow quadratically with the number of FUs NOP NOP A IR B align/shiftnetwork X A B C D E F G PC I-cache 4
Straight Forward Approach • Distribute I-fetch in spirit similar to distribution of data path • Local communication of controls • Reduce latency, hardware cost, power • Used in Multiflow Trace 14/300 processors Interconnection network Interconnection network Register File Register File Register File Register File FU FU FU FU FU FU FU FU IR IR PC PC I-cache I-cache 5
DVLIW Approach • Simple distribution has problems • Doesn’t support code compression • PC still a centralized resource Interconnection network Interconnection network Register File Register File Register File Register File FU FU FU FU FU FU FU FU IR IR align/shift align/shift PC PC0 PC1 I-cache I-cache 6
DVLIW Execution Model • Clusters execute in lock-step • When one cluster stalls, all clusters stall • Clusters collectively execute one thread • Each cluster runs an instruction stream • Compiler orchestrates the execution of streams • Compiler manages communication • Light weight synchronization 7
DVLIW Benefits • Completely decentralized architecture • Distributed data path • Distributed control path • Supports arbitrary code compression • Exploiting ILP on multi-core style system • Good for embedded applications • Low cost • Compiler support 8
DVLIW Architecture To cluster 1 To cluster 2 Banked L2 … FU IC MFU VLIWCluster 0 VLIWCluster 1 br_target Register Files … IR B A NOP align/shift VLIWCluster 2 VLIWCluster 3 L1 D-Cache Next PC A B PC L1 I-Cache Banked L2 To Banked L2 9
Code Organization DVLIW Conventional VLIW • Code for each cluster is consecutive in memory • Operations in the same MultiOp stored in different memory locations • Each cluster computes its own next PC PC PC0 PC1 10
Branch Mechanism • Maintain correct execution order • All clusters transfer control at the same cycle • All clusters branch to the same logical multiop • Unbundled branch in HPL-PD Each cluster specifies its own target PBR btr1, TARGET Branch CMPP pr0, (x>100)? Broadcast to all clusters BR btr1, pr0 Replicated in each cluster 11
Branch Handling Example … pbr btr1, BB2 cmpp pr0, (x>100)? … br btr1, pr0 … pbr btr1, BB2’ …. …. br btr1, pr0 … pbr btr1, BB2 cmpp pr0, (x>100)? bcast pr0 br btr1, pr0 Cluster 1 Cluster 0 Conventional VLIW DVLIW 12
Sleep Mode • Idle blocks after distribution • Put cluster into sleep mode • Compiler managed • Save energy • Reduce code size • Mode change happens at block boundary SLEEP BR BR BR BR WAKE Cluster 1 Cluster 0 13
Experimental Setup • Trimaran toolset • Processor configuration • 4 clusters, 2 INT, 1 FP, 1 MEM, 1 BR per cluster • 16K L1 I-cache total • Perfect data cache assumed • Power Model • Verilog for instruction align/shift logic • Wire model • Cacti cache model • 21 benchmarks from MediaBench and SPECINT2000 14
Change in Global Communication Bits MediaBench SPECINT 15
Normalized Energy Consumption on Control Path Control path energy = (align/shift logic energy) + (wire energy) + (I-cache energy) 40% saving 67% saving 80% saving 21% saving 16
Normalized Code Size Baseline: Conventional VLIW with compressed encoding Traditional method (single PC): 7x increase DVLIW: 40% increase 17
Result Summary • DVLIW benefits • Order of magnitude reduction in global communication • 40% savings in control path energy • 5x code size reduction vs. simple distribution • Small overhead for ILP execution on CMP • 3% increase in execution cycles • 4% increase in I-cache stalls 18
Conclusions • DVLIW removes last centralized resource in a multicluster VLIW • Fully distributed control path • Scalable architecture • More energy efficient • Stylized CMP architecture • Exploit ILP • Multiple instruction streams • Compiler orchestrated 19
Thank You • For more information • http://cccp.eecs.umich.edu 20