400 likes | 520 Views
Architectural Enhancements for Efficient Operand Transport in Multimedia Systems. ECE7102 Class Presentation Date: 2006. 4. 13 Hongkyu Kim hongkim@ece.gatech.edu. Overview. Introduction Characterization and modeling of operand usage and transport
E N D
Architectural Enhancements for Efficient Operand Transport in Multimedia Systems ECE7102 Class Presentation Date: 2006. 4. 13 Hongkyu Kim hongkim@ece.gatech.edu
Overview • Introduction • Characterization and modeling of operand usage and transport • Dynamic execution technique exploiting regular operand transport patterns in multimedia • Instruction cluster mapping on the inter-ALU network for general-purpose domain • Dynamic SIMDization for application-specific domain • Summary
Interconnect Complexity • Exponential increase of chip capacity More devices • Exponential decrease of feature size Interconnect limitation J.D. Meindl, Interconnect Opportunities for Gigascale Integration, IEEE MICRO, vol. 23, no. 4, pp.28-35, May/June 2003.
100 1/α2 10 Relative Delay 1/α2 1 α 0.1 250 180 130 90 65 45 42 Process Technology Node (nm) Interconnect Bottleneck • Disparity between wire delay and gate delay ITRS 2002 Documents, http://public.itrs.net/Files/2002Update/Home.pdf.
Problem Statement • High-performance interconnect • Interconnect organizations • Interconnect technologies • Why architectural responses are limited? • Compatibility with old ISAs • Sequentially-specified operations • Restricted register file-based operand namespace • ILP mechanisms • Operand bypass network, register renaming, and instruction scheduling • Poorly scaling broadcast buses
Research Objective and Approach • ObjectiveReduce latency of operand transport for multimedia • Development of dynamic execution techniques • Development of low-cost operand bypass networks • Approach summary
Overview • Introduction • Characterization and modeling of operand usage and transport • Dynamic execution technique exploiting regular operand transport patterns in multimedia • Instruction cluster mapping on the inter-ALU network for general-purpose domain • Dynamic SIMDization for application-specific domain • Summary
Motivation and Approach • Motivation • Shift of microarchitectural design focusOperand computation Operand communication • Recognizing and understanding of operand usage and transport properties Efficiently controlling operand traffic • Approach summary • Operand usage characteristics • How often operands are used Examine temporal property • Where operands are used Examine spatial property • Operand transport properties • What accounts for the majority of communication needs Explore the impact of architectural techniques on the operand transport
Operand Usage Analysis • General terms • Operands: values in registers, memory locations, or memory addresses • Operand transport: buffering and delivery of operands to FUs • Operands’ temporal characteristics • Which inst. consumes operands after they are produced • Metrics: Degree of use, Age, Lifetime • Operands’ spatial characteristics • From/to which FU operands are moved in the execution model • Metrics: Degree of functionality, Transport pattern
Operand Transport Analysis • Operand transport model
Preliminary Results • Operand usage properties (MediaBench average) 0 1 2 3 >3 1 2 3~5 >5 1 2 3~5 6~10 >10 0 1(same) 1(different) >1 H. Kim, D. Wills, and L. Wills, “Empirical analysis of operand usage and transport in multimedia applications,” Proc of the International Workshop on System-on-Chip for Real-Time Applications, pp. 168-171, July 2004.
Preliminary Results (cntd.) • Operand transport pattern (MediaBench average) integer integer 43.0% Others 8.1% integer branch 14.9% ld/st ld/st 6.6% ld/st integer 13.8% integer ld/st 13.6%
Preliminary Results (cntd.) • Effective architectural techniques on operand transport • Storage hierarchy: local buffering • Dedicated transport network • Lifetime detection: compile-time/run-time • Smart instruction steering
Overview • Introduction • Characterization and modeling of operand usage and transport • Dynamic execution technique exploiting regular operand transport patterns in multimedia • Instruction cluster mapping on the inter-ALU network for general-purpose domain • Dynamic SIMDization for application-specific domain • Summary
Motivation and Approach • MotivationMultimedia applications • Operand movement is highly regular • Most operands are short lived, transient operands • Develop dynamic execution technique exploiting regular operand distribution patterns and local properties • Approach summary • Instruction clustering: dynamic instruction grouping • Recognition of regular operand transport pattern • Efficient execution unit: reduce transport latency
Related Work • Solutions for multimedia processing • Multimedia-specific ISA extensions • Exploit data-level parallelism at subword level • General-purpose domain: Intel’s MMX and SSE, AMD’s 3DNow!, Sun’s VIS, IBM’s Altivec • Application-specific signal processing domain: Analog Device’s TigerShark, Trimedia • Vectorization and retargeting • Manual assembly coding • Hand-optimization: in-lined assembly code, library routines • Automatic vectorization: compiler/retargeting technology
Related Work (cntd.) • Solutions for reducing operand transport complexity • Communication-aware execution • Network-connected tile architecture: RAW, GPA • Transport triggered architecture: MOVE • Resource partitioning: Clustered architectures • Heterogeneous: decoupled architecture • Commercial: DEC Alpha21264 • Academia: Multicluster, Palacharla’s, PEWs, ILDP, CTCP • Dynamic optimizations • Fill unit: reform instructions in H/W, and cache them • Small-scale dependence collapsing: combine dependences among multiple instructions macro instruction
Dynamic Instruction Clustering • Instruction Cluster • A connected subgraph of instructions joined by local operands • Dataflow graph Dependence edge classification Instruction grouping • Dependence edge types • External: produced/consumed by previous/next blocks • Non-clusterable: operands from/to memory • Local: produced and consumed within the same block
Instruction Clustering Example • Color conversion block in JPEG encoder
Overview • Introduction • Characterization and modeling of operand usage and transport • Dynamic execution technique exploiting regular operand transport patterns in multimedia • Instruction cluster mapping on the inter-ALU network for general-purpose domain • Dynamic SIMDization for application-specific domain • Summary
Implementation Example - I • Raw cluster execution on inter-ALU network • Focus on intermediate, short-lived operands • Local operands: inter-ALU dedicated bypass network • Others: traditional global bypass network • Organization • Instruction cluster formation • Cluster queue and scheduling • Cluster execution: inter-ALU network H. Kim, D. Wills, and L. Wills, “Reducing operand communication overhead using instruction clustering for multimedia applications,” Proc of 7th International Symposium on Multimedia, December 2005.
Cluster Queue and Scheduling • Organization of cluster queue • Single entry per cluster (2D) • Ready flag for local operands are always set • Issue pointer for each entry, in-order issue
Cluster Execution Unit • Cluster mapping on inter-ALU network • Local operands: dedicated bypass network • Others: traditional global bypass network
Experimental Setup • Simulation Environment • SimpleScalar sim-outorder simulator • MediaBench application programs • Processor Configurations
Experimental Result • Dynamic instruction coverage
Experimental Result (cntd.) • Operand transport types 59.5% 57.8% 11.0% 10.6% 29.5% 31.5%
Experimental Result (cntd.) • IPC speedup
Summary • Summary of approach • Dynamically group dependent instructions into clusters • Store regular operand transport patterns • Execute them on inter-ALU network where intermediate values are propagated among ALUs w/o/ using global buses • Summary of results (MediaBench average) • Dynamic instruction coverage • Shortest transport rate • IPC speedup 57.3% @ 256 entry cluster cache 30% 32% 16-way 8-way 16-way 8-way 16.2% 35.2%
Overview • Introduction • Characterization and modeling of operand usage and transport • Dynamic execution technique exploiting regular operand transport patterns in multimedia • Instruction cluster mapping on the inter-ALU network for general-purpose domain • Dynamic SIMDization for application-specific domain • Summary
Implementation Example - II • Data parallel execution using dynamic SIMDization • Observation (Image processing applications) • Operand movement w/in a loop iteration is highly regular • Small # of inner loops covers most of execution time • Focus on regular operand transport pattern between iterations of innermost loop • Stride prediction: break loop-carried dependences data-parallel execution • Operand lifetime detection operand traffic control • Organization • Instruction cluster formation • SIMD instruction queue and scheduling • SIMD PE array
Dynamic Instruction Clustering • External dependence edge types • External-input: serving only as input • External-output: serving only as output • External-updated: serving as both input and output • Parallel and non-parallel region detection • p-cluster: producing no external-updated output and not having unpredicted external-updated input • np-cluster
Instruction Clustering Example • Image convolution code in TI’s IMGLIB
SIMD Execution Unit • Cluster scheduling on SIMD PE array
SIMD Execution Unit (cntd.) • Operand transport model
Summary of Approach • Dynamic parallelization • Detect regular operand transport pattern on external-updated • Compute stride predict external-update values • Optimizing operand transport • Identify the lifetime of operands • Remove needless communication localize transport • Execute the clusters on 1-D mesh SIMD PE array
Overview • Introduction • Characterization and modeling of operand usage and transport • Dynamic execution technique exploiting regular operand transport patterns in multimedia • Instruction cluster mapping on the inter-ALU network for general-purpose domain • Dynamic SIMDization for application-specific domain • Summary
Summary • Characterization and modeling of operand • Examine the operand usage properties • Explore the impact of architectural techniques on the operand transport • Development of a dynamic execution technique • Instruction clustering • Recognition of regular operand transport pattern • Efficient execution unit