1 / 40

Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

Architectural Enhancements for Efficient Operand Transport in Multimedia Systems. ECE7102 Class Presentation Date: 2006. 4. 13 Hongkyu Kim hongkim@ece.gatech.edu. Overview. Introduction Characterization and modeling of operand usage and transport

haruko
Download Presentation

Architectural Enhancements for Efficient Operand Transport in Multimedia Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Architectural Enhancements for Efficient Operand Transport in Multimedia Systems ECE7102 Class Presentation Date: 2006. 4. 13 Hongkyu Kim hongkim@ece.gatech.edu

  2. Overview • Introduction • Characterization and modeling of operand usage and transport • Dynamic execution technique exploiting regular operand transport patterns in multimedia • Instruction cluster mapping on the inter-ALU network for general-purpose domain • Dynamic SIMDization for application-specific domain • Summary

  3. Interconnect Complexity • Exponential increase of chip capacity  More devices • Exponential decrease of feature size  Interconnect limitation J.D. Meindl, Interconnect Opportunities for Gigascale Integration, IEEE MICRO, vol. 23, no. 4, pp.28-35, May/June 2003.

  4. 100 1/α2 10 Relative Delay 1/α2 1 α 0.1 250 180 130 90 65 45 42 Process Technology Node (nm) Interconnect Bottleneck • Disparity between wire delay and gate delay ITRS 2002 Documents, http://public.itrs.net/Files/2002Update/Home.pdf.

  5. Problem Statement • High-performance interconnect • Interconnect organizations • Interconnect technologies • Why architectural responses are limited? • Compatibility with old ISAs • Sequentially-specified operations • Restricted register file-based operand namespace • ILP mechanisms • Operand bypass network, register renaming, and instruction scheduling • Poorly scaling broadcast buses

  6. Research Objective and Approach • ObjectiveReduce latency of operand transport for multimedia • Development of dynamic execution techniques • Development of low-cost operand bypass networks • Approach summary

  7. Overview • Introduction • Characterization and modeling of operand usage and transport • Dynamic execution technique exploiting regular operand transport patterns in multimedia • Instruction cluster mapping on the inter-ALU network for general-purpose domain • Dynamic SIMDization for application-specific domain • Summary

  8. Motivation and Approach • Motivation • Shift of microarchitectural design focusOperand computation  Operand communication • Recognizing and understanding of operand usage and transport properties  Efficiently controlling operand traffic • Approach summary • Operand usage characteristics • How often operands are used  Examine temporal property • Where operands are used  Examine spatial property • Operand transport properties • What accounts for the majority of communication needs Explore the impact of architectural techniques on the operand transport

  9. Operand Usage Analysis • General terms • Operands: values in registers, memory locations, or memory addresses • Operand transport: buffering and delivery of operands to FUs • Operands’ temporal characteristics • Which inst. consumes operands after they are produced • Metrics: Degree of use, Age, Lifetime • Operands’ spatial characteristics • From/to which FU operands are moved in the execution model • Metrics: Degree of functionality, Transport pattern

  10. Operand Transport Analysis • Operand transport model

  11. Preliminary Results • Operand usage properties (MediaBench average) 0 1 2 3 >3 1 2 3~5 >5 1 2 3~5 6~10 >10 0 1(same) 1(different) >1 H. Kim, D. Wills, and L. Wills, “Empirical analysis of operand usage and transport in multimedia applications,” Proc of the International Workshop on System-on-Chip for Real-Time Applications, pp. 168-171, July 2004.

  12. Preliminary Results (cntd.) • Operand transport pattern (MediaBench average) integer  integer 43.0% Others 8.1% integer  branch 14.9% ld/st  ld/st 6.6% ld/st  integer 13.8% integer  ld/st 13.6%

  13. Preliminary Results (cntd.) • Effective architectural techniques on operand transport • Storage hierarchy: local buffering • Dedicated transport network • Lifetime detection: compile-time/run-time • Smart instruction steering

  14. Overview • Introduction • Characterization and modeling of operand usage and transport • Dynamic execution technique exploiting regular operand transport patterns in multimedia • Instruction cluster mapping on the inter-ALU network for general-purpose domain • Dynamic SIMDization for application-specific domain • Summary

  15. Motivation and Approach • MotivationMultimedia applications • Operand movement is highly regular • Most operands are short lived, transient operands • Develop dynamic execution technique exploiting regular operand distribution patterns and local properties • Approach summary • Instruction clustering: dynamic instruction grouping • Recognition of regular operand transport pattern • Efficient execution unit: reduce transport latency

  16. Related Work • Solutions for multimedia processing • Multimedia-specific ISA extensions • Exploit data-level parallelism at subword level • General-purpose domain: Intel’s MMX and SSE, AMD’s 3DNow!, Sun’s VIS, IBM’s Altivec • Application-specific signal processing domain: Analog Device’s TigerShark, Trimedia • Vectorization and retargeting • Manual assembly coding • Hand-optimization: in-lined assembly code, library routines • Automatic vectorization: compiler/retargeting technology

  17. Related Work (cntd.) • Solutions for reducing operand transport complexity • Communication-aware execution • Network-connected tile architecture: RAW, GPA • Transport triggered architecture: MOVE • Resource partitioning: Clustered architectures • Heterogeneous: decoupled architecture • Commercial: DEC Alpha21264 • Academia: Multicluster, Palacharla’s, PEWs, ILDP, CTCP • Dynamic optimizations • Fill unit: reform instructions in H/W, and cache them • Small-scale dependence collapsing: combine dependences among multiple instructions  macro instruction

  18. Related Research Landscape

  19. Research Methodology

  20. Dynamic Instruction Clustering • Instruction Cluster • A connected subgraph of instructions joined by local operands • Dataflow graph  Dependence edge classification Instruction grouping • Dependence edge types • External: produced/consumed by previous/next blocks • Non-clusterable: operands from/to memory • Local: produced and consumed within the same block

  21. Instruction Clustering Example • Color conversion block in JPEG encoder

  22. Overview • Introduction • Characterization and modeling of operand usage and transport • Dynamic execution technique exploiting regular operand transport patterns in multimedia • Instruction cluster mapping on the inter-ALU network for general-purpose domain • Dynamic SIMDization for application-specific domain • Summary

  23. Implementation Example - I • Raw cluster execution on inter-ALU network • Focus on intermediate, short-lived operands • Local operands: inter-ALU dedicated bypass network • Others: traditional global bypass network • Organization • Instruction cluster formation • Cluster queue and scheduling • Cluster execution: inter-ALU network H. Kim, D. Wills, and L. Wills, “Reducing operand communication overhead using instruction clustering for multimedia applications,” Proc of 7th International Symposium on Multimedia, December 2005.

  24. Cluster Queue and Scheduling • Organization of cluster queue • Single entry per cluster (2D) • Ready flag for local operands are always set • Issue pointer for each entry, in-order issue

  25. Cluster Execution Unit • Cluster mapping on inter-ALU network • Local operands: dedicated bypass network • Others: traditional global bypass network

  26. Experimental Setup • Simulation Environment • SimpleScalar sim-outorder simulator • MediaBench application programs • Processor Configurations

  27. Experimental Result • Dynamic instruction coverage

  28. Experimental Result (cntd.) • Operand transport types 59.5% 57.8% 11.0% 10.6% 29.5% 31.5%

  29. Experimental Result (cntd.) • IPC speedup

  30. Summary • Summary of approach • Dynamically group dependent instructions into clusters • Store regular operand transport patterns • Execute them on inter-ALU network where intermediate values are propagated among ALUs w/o/ using global buses • Summary of results (MediaBench average) • Dynamic instruction coverage • Shortest transport rate • IPC speedup 57.3% @ 256 entry cluster cache 30% 32% 16-way 8-way 16-way 8-way 16.2% 35.2%

  31. Overview • Introduction • Characterization and modeling of operand usage and transport • Dynamic execution technique exploiting regular operand transport patterns in multimedia • Instruction cluster mapping on the inter-ALU network for general-purpose domain • Dynamic SIMDization for application-specific domain • Summary

  32. Implementation Example - II • Data parallel execution using dynamic SIMDization • Observation (Image processing applications) • Operand movement w/in a loop iteration is highly regular • Small # of inner loops covers most of execution time • Focus on regular operand transport pattern between iterations of innermost loop • Stride prediction: break loop-carried dependences  data-parallel execution • Operand lifetime detection  operand traffic control • Organization • Instruction cluster formation • SIMD instruction queue and scheduling • SIMD PE array

  33. Dynamic Instruction Clustering • External dependence edge types • External-input: serving only as input • External-output: serving only as output • External-updated: serving as both input and output • Parallel and non-parallel region detection • p-cluster: producing no external-updated output and not having unpredicted external-updated input • np-cluster

  34. Instruction Clustering Example • Image convolution code in TI’s IMGLIB

  35. SIMD Execution Unit • Cluster scheduling on SIMD PE array

  36. SIMD Execution Unit (cntd.) • Operand transport model

  37. Summary of Approach • Dynamic parallelization • Detect regular operand transport pattern on external-updated • Compute stride  predict external-update values • Optimizing operand transport • Identify the lifetime of operands • Remove needless communication  localize transport • Execute the clusters on 1-D mesh SIMD PE array

  38. Overview • Introduction • Characterization and modeling of operand usage and transport • Dynamic execution technique exploiting regular operand transport patterns in multimedia • Instruction cluster mapping on the inter-ALU network for general-purpose domain • Dynamic SIMDization for application-specific domain • Summary

  39. Summary • Characterization and modeling of operand • Examine the operand usage properties • Explore the impact of architectural techniques on the operand transport • Development of a dynamic execution technique • Instruction clustering • Recognition of regular operand transport pattern • Efficient execution unit

  40. Thank you. Any questions?

More Related