740 likes | 754 Views
Rajeev Krishna Advanced Computer Architecture Lab University of Michigan. Hardware Architectures to Support Low Power Natural I/O Applications. Why Look at Natural I/O?. Wave of the present! Example: Basic speech recognition everywhere Representative of Natural I/O applications Why?
E N D
Rajeev Krishna Advanced Computer Architecture Lab University of Michigan Hardware Architectures to Support Low Power Natural I/O Applications
Why Look at Natural I/O? • Wave of the present! • Example: Basic speech recognition everywhere • Representative of Natural I/O applications • Why? • Ubiquitous computing • More versatile • What is the problem? • Computational complexity vs. available performance • Constraints of mobile computing platforms
Computation and Energy: Supply vs. Demand • Continuous, Speaker-Independent, Large Vocabulary • Embedded Processor Performance • SA1110 (200MHz) - 6 hours • XScale (400MHz) - 2 hours • Embedded Processor Performance • - 20 wpm • - 50 wpm Hello Hello
Computation and Energy: Supply vs. Demand • Mobile (laptop) Processor Performance • PIII (1GHz) – 200 wpm • Mobile (laptop) Processor Performance • PIII (1GHz) – 200 wpm - 6 minutes Hello Performance vs. Accuracy vs. Energy
Outline of Presentation • Speech Recognition Theory • Architectural and Programming Model • Architectural Evaluation • Memory System Design • Power Management • Conclusions
Algorithmic Challenges • What is so hard about speech recognition? • Time Warping • Co-articulations • Boundary Identification • Word Selection • Imagine listening in a noisy room • Estimate likely sounds • Apply context clues • Guess
Straightforward Key Behaviors The Process • DSP Signal Processing • Pattern Mapping to Knowledge Base • Acoustic Scoring • Linguistic Scoring
Linguistic Search DH EH R [word] K AA R “Their Car” =
Linguistic Search DH EH R [word] K AA R “Their Car” =
Linguistic Search DH EH R [word] K AA R DH P(“DH”)
Linguistic Search DH EH R [word] K AA R DH
Linguistic Search DH EH R [word] K AA R DH
DH EH R AX IH AH IY “The” “Ear” [word] Linguistic Search DH EH R [word] K AA R “Their”
Linguistic Search DH EH R [word] K AA R “Their” DH EH R AX IH AH IY “The” “Ear” [word]
T AE P DH K EH AA R R AX IH “Cap” AH IY “Cat” Linguistic Search DH EH R [word] K AA R “Their” “Car” “The” “Ear” [word] [word]
EH P L K AE T OY F S N DH NH AA EH R R AX IH AH IY Linguistic Search DH EH R [word] K AA R
T OW IY DH K AE G NH F P N L EH OY TH SH T S EH AA R R AX IH AH IY Linguistic Search DH EH R [word] K AA R
T G SH DH K TH P AE T SH F S N L EH NH GH AX OY V IY OW G DK ZH ER IY IH CH K DUH OW IH Z OW JH F EH AA R R AX IH AH IY Linguistic Search DH EH R [word] K AA R
General Characteristics • Poor Memory Performance • Large memory footprint • Little locality in reference stream • Little low level predictability • Thread Level Concurrency • 1000’s to 10,000’s of active nodes per iteration • Relatively little interdependence
Target Model • Exploit Concurrency • Fine grain thread management • Minimal communication • Parallel execution • Tolerate Latency • Maximize processor utilization • Hardware Multithreading • Runtime Adaptation • Unknown, Input-driven behaviour • Dynamic Programming Model
Architectural Model - Overview • Base Xscale 400MHz Embedded Processor • Speech processing unit • Memory System Interface
Architectural Model – Processing Element • Execution model based on simple integer pipeline • Per-thread register contexts • Control logic / Work Queue • Small cache
Programming Model • Maximum concurrency, minimum communication, dynamic • Expose all reasonable concurrency to hardware • Initial static workload distribution + dynamic balancing • Key based lock-less fine grain mutual exclusion spawn ([PC], [arguments], [exclusion ID]) spawn ([PC], [arguments], [node address]) Fork/Join Vector Model on XScale
Memory Partition 1 Memory Partition 2 Memory Partition 3 Programming Model 11 12 21 22 31 32 13 14 23 24 33 34 35 15 18 25 26 27 36 16 17 28 19 10
Analysis Framework • Multi-pipeline simulator based on SimpleScalar/ARM • Hand parallelized copy of CMU-Sphinx library • 11447 word vocabulary, ~ 17 MB • Static load balancing via hMetis (profiled graph) • Ideal Memory System • Fixed memory latency, unlimited bandwidth • Power Model • Activity based component level energy estimation • Extensive details in Appendix B
Performance • Near ideal performance • Loss mitigated by added contexts • 40% overhead
Performance • Near ideal performance • Loss mitigated by added contexts • 40% overhead
Idealized Energy Consumption • Energy for Ideal system • Reduction in energy due to reduced time dissipating static power • Demonstrates potential for mitigating increased energy consumption of hardware
Latency Tolerance • Relative performance of 100 cycle memory latency compared to 50 cycle memory latency • Still unlimited bandwidth • Added contexts tolerates much of added delay
Meet the Memory Wall • High detail 100MHz SDRAM latency simulator
Meet the Memory Wall • High detail 100MHz SDRAM latency simulator
Memory System Design • Decrease memory demand • Caching • Compression • Increase memory bandwidth • Increase channel width / clockrate / banking • Flash / ROM subsystem for immutable data • Embedded DRAM for mutable data • Focus on data stream
Caching • Per-pipeline L1 data cache Cache Control Pipeline
DRAM Controller Cache Control XScale Processor Speech Processor Pipeline Caching • Global L2 data cache
Caching • Global L2 data cache DRAM Controller Cache XScale Processor Speech Processor
2K, 4-way Caching • Miss ratios in L1 data cache stream
128K, 4-way Caching • Miss ratios in L2 data cache stream
Caching • Performance and EDP with 128K L2
Caching • Where is this locality?
Data Compression • Ineffective at L2 • Multi-line data elements either way • Somewhat algorithm dependent • Great potential in memory system • Off-chip decompression = no performance impact
DDR Memory • Performance and EDP 200MHz DDR memory system
DDR Memory • L2 over DDR • L2+DDR over L2+SDRAM
Bandwidth Optimizations • Stream partitioning of immutable data • Dual-banked Flash / ROM needed • Added latency not an issue • Significant potential energy savings • Mutable data in partitioned, on-chip embedded DRAM • Still require small L2 for shared metadata • 25%+ greater performance • 15-30% greater energy consumption
Power Management • What to do with extra time? • Enter low-power standby • 10% energy savings in ideal case • 2% with no frame buffering • Scale frequency / voltage • 25-30% energy savings in ideal case • 20-25% with per-frame modulation