Hardware Architectures to Support Low Power Natural I/O Applications

Rajeev Krishna Advanced Computer Architecture Lab University of Michigan Hardware Architectures to Support Low Power Natural I/O Applications

Why Look at Natural I/O? • Wave of the present! • Example: Basic speech recognition everywhere • Representative of Natural I/O applications • Why? • Ubiquitous computing • More versatile • What is the problem? • Computational complexity vs. available performance • Constraints of mobile computing platforms

Computation and Energy: Supply vs. Demand • Continuous, Speaker-Independent, Large Vocabulary • Embedded Processor Performance • SA1110 (200MHz) - 6 hours • XScale (400MHz) - 2 hours • Embedded Processor Performance • - 20 wpm • - 50 wpm Hello Hello

Computation and Energy: Supply vs. Demand • Mobile (laptop) Processor Performance • PIII (1GHz) – 200 wpm • Mobile (laptop) Processor Performance • PIII (1GHz) – 200 wpm - 6 minutes Hello Performance vs. Accuracy vs. Energy

Outline of Presentation • Speech Recognition Theory • Architectural and Programming Model • Architectural Evaluation • Memory System Design • Power Management • Conclusions

Speech Recognition

Algorithmic Challenges • What is so hard about speech recognition? • Time Warping • Co-articulations • Boundary Identification • Word Selection • Imagine listening in a noisy room • Estimate likely sounds • Apply context clues • Guess

Straightforward Key Behaviors The Process • DSP Signal Processing • Pattern Mapping to Knowledge Base • Acoustic Scoring • Linguistic Scoring

Linguistic Search DH EH R [word] K AA R “Their Car” =

Linguistic Search DH EH R [word] K AA R DH P(“DH”)

Linguistic Search DH EH R [word] K AA R DH

DH EH R AX IH AH IY “The” “Ear” [word] Linguistic Search DH EH R [word] K AA R “Their”

Linguistic Search DH EH R [word] K AA R “Their” DH EH R AX IH AH IY “The” “Ear” [word]

T AE P DH K EH AA R R AX IH “Cap” AH IY “Cat” Linguistic Search DH EH R [word] K AA R “Their” “Car” “The” “Ear” [word] [word]

EH P L K AE T OY F S N DH NH AA EH R R AX IH AH IY Linguistic Search DH EH R [word] K AA R

T OW IY DH K AE G NH F P N L EH OY TH SH T S EH AA R R AX IH AH IY Linguistic Search DH EH R [word] K AA R

T G SH DH K TH P AE T SH F S N L EH NH GH AX OY V IY OW G DK ZH ER IY IH CH K DUH OW IH Z OW JH F EH AA R R AX IH AH IY Linguistic Search DH EH R [word] K AA R

General Characteristics • Poor Memory Performance • Large memory footprint • Little locality in reference stream • Little low level predictability • Thread Level Concurrency • 1000’s to 10,000’s of active nodes per iteration • Relatively little interdependence

Architecture and Programming Model

Target Model • Exploit Concurrency • Fine grain thread management • Minimal communication • Parallel execution • Tolerate Latency • Maximize processor utilization • Hardware Multithreading • Runtime Adaptation • Unknown, Input-driven behaviour • Dynamic Programming Model

Architectural Model - Overview • Base Xscale 400MHz Embedded Processor • Speech processing unit • Memory System Interface

Architectural Model – Processing Element • Execution model based on simple integer pipeline • Per-thread register contexts • Control logic / Work Queue • Small cache

Programming Model • Maximum concurrency, minimum communication, dynamic • Expose all reasonable concurrency to hardware • Initial static workload distribution + dynamic balancing • Key based lock-less fine grain mutual exclusion spawn ([PC], [arguments], [exclusion ID]) spawn ([PC], [arguments], [node address]) Fork/Join Vector Model on XScale

Memory Partition 1 Memory Partition 2 Memory Partition 3 Programming Model 11 12 21 22 31 32 13 14 23 24 33 34 35 15 18 25 26 27 36 16 17 28 19 10

Architectural Evaluation

Analysis Framework • Multi-pipeline simulator based on SimpleScalar/ARM • Hand parallelized copy of CMU-Sphinx library • 11447 word vocabulary, ~ 17 MB • Static load balancing via hMetis (profiled graph) • Ideal Memory System • Fixed memory latency, unlimited bandwidth • Power Model • Activity based component level energy estimation • Extensive details in Appendix B

Performance • Near ideal performance • Loss mitigated by added contexts • 40% overhead

Idealized Energy Consumption • Energy for Ideal system • Reduction in energy due to reduced time dissipating static power • Demonstrates potential for mitigating increased energy consumption of hardware

Latency Tolerance • Relative performance of 100 cycle memory latency compared to 50 cycle memory latency • Still unlimited bandwidth • Added contexts tolerates much of added delay

Meet the Memory Wall • High detail 100MHz SDRAM latency simulator

Memory System Design

Memory System Design • Decrease memory demand • Caching • Compression • Increase memory bandwidth • Increase channel width / clockrate / banking • Flash / ROM subsystem for immutable data • Embedded DRAM for mutable data • Focus on data stream

Caching • Per-pipeline L1 data cache Cache Control Pipeline

DRAM Controller Cache Control XScale Processor Speech Processor Pipeline Caching • Global L2 data cache

Caching • Global L2 data cache DRAM Controller Cache XScale Processor Speech Processor

2K, 4-way Caching • Miss ratios in L1 data cache stream

128K, 4-way Caching • Miss ratios in L2 data cache stream

Caching • Performance and EDP with 128K L2

Caching • Where is this locality?

Data Compression • Ineffective at L2 • Multi-line data elements either way • Somewhat algorithm dependent • Great potential in memory system • Off-chip decompression = no performance impact

DDR Memory • Performance and EDP 200MHz DDR memory system

DDR Memory • L2 over DDR • L2+DDR over L2+SDRAM

Bandwidth Optimizations • Stream partitioning of immutable data • Dual-banked Flash / ROM needed • Added latency not an issue • Significant potential energy savings • Mutable data in partitioned, on-chip embedded DRAM • Still require small L2 for shared metadata • 25%+ greater performance • 15-30% greater energy consumption

Power Management

Power Management • What to do with extra time? • Enter low-power standby • 10% energy savings in ideal case • 2% with no frame buffering • Scale frequency / voltage • 25-30% energy savings in ideal case • 20-25% with per-frame modulation

Technology Trends

Hardware Architectures to Support Low Power Natural I/O Applications