A Distributed Control Path Architecture for VLIW Processors

A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan, Scott Mahlke, and Michael Schlansker* Advanced Computer Architecture Laboratory University of Michigan *HP Laboratories 1

Motivation • VLIW Scaling Problem • Centralized resource • Highly ported structures • Wire delays Register File Register File FU FU FU FU FU FU FU FU FU FU … Instruction Fetch/Decode Instruction Fetch/Decode 2

Multicluster VLIW • Distribute register files • Cluster function units • Distribute data caches • Clusters communicate through interconnection network • Used in TI C6x, Lx/ST200, Analog Tigersharc Interconnection network Cluster 0 Cluster 1 Register File Register File FU FU FU FU Instruction Fetch/Decode 3

Control Path Scaling Problem • Larger I-cache • Latency • Long wires for control signals distribution • Code compression • Hardware cost, power • Grow quadratically with the number of FUs NOP NOP A IR B align/shiftnetwork X A B C D E F G PC I-cache 4

Straight Forward Approach • Distribute I-fetch in spirit similar to distribution of data path • Local communication of controls • Reduce latency, hardware cost, power • Used in Multiflow Trace 14/300 processors Interconnection network Interconnection network Register File Register File Register File Register File FU FU FU FU FU FU FU FU IR IR PC PC I-cache I-cache 5

DVLIW Approach • Simple distribution has problems • Doesn’t support code compression • PC still a centralized resource Interconnection network Interconnection network Register File Register File Register File Register File FU FU FU FU FU FU FU FU IR IR align/shift align/shift PC PC0 PC1 I-cache I-cache 6

DVLIW Execution Model • Clusters execute in lock-step • When one cluster stalls, all clusters stall • Clusters collectively execute one thread • Each cluster runs an instruction stream • Compiler orchestrates the execution of streams • Compiler manages communication • Light weight synchronization 7

DVLIW Benefits • Completely decentralized architecture • Distributed data path • Distributed control path • Supports arbitrary code compression • Exploiting ILP on multi-core style system • Good for embedded applications • Low cost • Compiler support 8

DVLIW Architecture To cluster 1 To cluster 2 Banked L2 … FU IC MFU VLIWCluster 0 VLIWCluster 1 br_target Register Files … IR B A NOP align/shift VLIWCluster 2 VLIWCluster 3 L1 D-Cache Next PC A B PC L1 I-Cache Banked L2 To Banked L2 9

Code Organization DVLIW Conventional VLIW • Code for each cluster is consecutive in memory • Operations in the same MultiOp stored in different memory locations • Each cluster computes its own next PC PC PC0 PC1 10

Branch Mechanism • Maintain correct execution order • All clusters transfer control at the same cycle • All clusters branch to the same logical multiop • Unbundled branch in HPL-PD Each cluster specifies its own target PBR btr1, TARGET Branch CMPP pr0, (x>100)? Broadcast to all clusters BR btr1, pr0 Replicated in each cluster 11

Branch Handling Example … pbr btr1, BB2 cmpp pr0, (x>100)? … br btr1, pr0 … pbr btr1, BB2’ …. …. br btr1, pr0 … pbr btr1, BB2 cmpp pr0, (x>100)? bcast pr0 br btr1, pr0 Cluster 1 Cluster 0 Conventional VLIW DVLIW 12

Sleep Mode • Idle blocks after distribution • Put cluster into sleep mode • Compiler managed • Save energy • Reduce code size • Mode change happens at block boundary SLEEP BR BR BR BR WAKE Cluster 1 Cluster 0 13

Experimental Setup • Trimaran toolset • Processor configuration • 4 clusters, 2 INT, 1 FP, 1 MEM, 1 BR per cluster • 16K L1 I-cache total • Perfect data cache assumed • Power Model • Verilog for instruction align/shift logic • Wire model • Cacti cache model • 21 benchmarks from MediaBench and SPECINT2000 14

Change in Global Communication Bits MediaBench SPECINT 15

Normalized Energy Consumption on Control Path Control path energy = (align/shift logic energy) + (wire energy) + (I-cache energy) 40% saving 67% saving 80% saving 21% saving 16

Normalized Code Size Baseline: Conventional VLIW with compressed encoding Traditional method (single PC): 7x increase DVLIW: 40% increase 17

Result Summary • DVLIW benefits • Order of magnitude reduction in global communication • 40% savings in control path energy • 5x code size reduction vs. simple distribution • Small overhead for ILP execution on CMP • 3% increase in execution cycles • 4% increase in I-cache stalls 18

Conclusions • DVLIW removes last centralized resource in a multicluster VLIW • Fully distributed control path • Scalable architecture • More energy efficient • Stylized CMP architecture • Exploit ILP • Multiple instruction streams • Compiler orchestrated 19

Thank You • For more information • http://cccp.eecs.umich.edu 20

A Distributed Control Path Architecture for VLIW Processors

A Distributed Control Path Architecture for VLIW Processors

Presentation Transcript

SYNTHESIS OF APPLICATION SPECIFIC VLIW PROCESSORS

Computer Architecture Parallel Processors

Computer Architecture Superscalar Processors

A Free Market Architecture for Distributed Control of a Multirobot System

A Loop Accelerator for Low Power Embedded VLIW Processors

SourceSync : A Distributed Architecture for Sender Diversity

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors

Clustered Data Cache Designs for VLIW Processors

VLIW Architecture

Deploying a distributed access control architecture within a VO network

Novel Multimedia Instruction Capabilities in VLIW Media Processors

DAMN : A Distributed Architecture for Mobile Navigation

A Distributed Configuration Tool for Distributed Control Systems

A Distributed Data Architecture

Distributed Architecture

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors

Multiple Issue Processors: Superscalar and VLIW

Computer Architecture VLIW Architectures

VLIW Processors

Distributed and Reconfigurable Architecture for Flight Control System

Compiler Supports and Optimizations for PAC VLIW DSP Processors

Flexible Compiler-Managed L0 Buffers for Clustered VLIW Processors