1 / 22

Explicit HW and SW Hierarchies High-Level Abstractions for giving the system what it wants

Mattan Erez The University of Texas at Austin Salishan 2011. Explicit HW and SW Hierarchies High-Level Abstractions for giving the system what it wants . Power and reliability bound performance. More and more components Per-component improvement too slow. 1 GW. 100 MW. 10 MW. 1 MW.

raiden
Download Presentation

Explicit HW and SW Hierarchies High-Level Abstractions for giving the system what it wants

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mattan ErezThe University of Texas at Austin Salishan2011 Explicit HW and SW HierarchiesHigh-Level Abstractions for giving the system what it wants

  2. (c) Mattan Erez, UT Austin Power and reliability bound performance • More and more components • Per-component improvement too slow 1 GW 100 MW 10 MW 1 MW 100 KW 10 KW 1 KW Exa Tera Peta

  3. (c) Mattan Erez, UT Austin Power and reliability bound performance • More and more components • Per-component improvement too slow

  4. (c) Mattan Erez, UT Austin What can we do? • Compute less and store less • Use better algorithms • Specialize more • But still innovate on algorithms • Waste less • Minimize movement • Dynamically rebalance hardware • Efficient resiliency for reliability • Minimize redundancy • Tradeoff inherent reliability and resiliency

  5. (c) Mattan Erez, UT Austin Power is a zero-sum game • Tradeoff control, compute, storage, comm. • Dense algebra • Large sparse data • Building data structures

  6. (c) Mattan Erez, UT Austin Hierarchy enables HW/SW co-tuning and co-design • Hierarchy as common abstraction for HW and SW • Basic engineering • Match abstractions • Portability to ensure progress • Co-design cycle • Portability to ensure efficiency • Co-tune for proportionality

  7. Hardware hierarchy – locality • Communication and storage dominate energy • Closer and smaller == better • Amortize cost of global operations 20mm 20 pJ 64-bit DP DRAM Rd/Wr 26 pJ 256 pJ 16 nJ Efficient off-chip link 256-bit buses 500 pJ 50 pJ 256-bit access 8 kB SRAM 1 nJ 28nm

  8. (c) Mattan Erez, UT Austin Locality hierarchy “minimizes” hardware • Efficiency/performance tradeoffs • Efficiency goes up as BW goes down

  9. (c) Mattan Erez, UT Austin Hardware hierarchy – control • Specialization is a form of hierarchy • Amortize SW control decisions in HW • Sophisticated high-level control • Dynamic rebalancing • Simple low-level control • Minimize hardware waste • How far can we push this?

  10. Dual-core PC 4 node cluster of PCs System with a GPU Cluster of dual Cell blades Main memory Main memory Aggregate cluster memory (virtual level) Main memory Main memory Aggregate cluster memory (virtual level) L2 cache GPU memory L1 cache L1 cache LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS Node memory Node memory Node memory Node memory ALUs ALUs SM SM SM SM SM SM SM SM SM SM … L2 cache L2 cache L2 cache L2 cache ALUs ALUs ALUs ALUs ALUs ALUs ALUs ALUs ALUs ALUs L1 cache L1 cache L1 cache L1 cache matmul_L1 32x32 matrix mult matmul_L2 256x256 matrix mult matmul_L1 32x32 matrix mult matmul_L1 32x32 matrix mult matmul_L1 32x32 matrix mult matmul_L2 256x256 matrix mult matmul_L1 32x32 matrix mult matmul_L1 32x32 matrix mult matmul_L1 32x32 matrix mult matmul_L1 32x32 matrix mult ALUs ALUs ALUs ALUs Hierarchical HW  hierarchical SW • Hierarchy is least abstract common denominator matmul large matrix mult A B C . . . . . . . . . ... ...

  11. Task hierarchies task matmul::inner( in float A[M][T], in float B[T][N], inout float C[M][N] ) { tunable int P, Q, R; mappar( int i=0 to M/P, int j=0 to N/R ) { mapseq( int k=0 to T/Q ) { matmul( A[P*i:P*(i+1);P][Q*k:Q*(k+1);Q], B[Q*k:Q*(k+1);Q][R*j:R*(j+1);R], C[P*i:P*(i+1);P][R*j:R*(j+1);R] ); } } } task matmul::leaf( in float A[M][T], in float B[T][N], inout float C[M][N] ) { for (int i=0; i<M; i++) for (int j=0; j<N; j++) for (int k=0; k<T; k++) C[i][j] += A[i][k] * B[k][j]; } Variant call graph matmul::inner matmul::leaf

  12. Task hierarchies task matmul::inner( in float A[M][T], in float B[T][N], inout float C[M][N] ) { tunable int P, Q, R; mappar( int i=0 to M/P, int j=0 to N/R ) { mapseq( int k=0 to T/Q ) { matmul( A[P*i:P*(i+1);P][Q*k:Q*(k+1);Q], B[Q*k:Q*(k+1);Q][R*j:R*(j+1);R], C[P*i:P*(i+1);P][R*j:R*(j+1);R] ); } } } task matmul::leaf( in float A[M][T], in float B[T][N], inout float C[M][N] ) { for (int i=0; i<M; i++) for (int j=0; j<N; j++) for (int k=0; k<T; k++) C[i][j] += A[i][k] * B[k][j]; } Calling task: matmul::inner Located at level X B C A A B C Callee task: matmul::leaf Located at level Y

  13. (c) Mattan Erez, UT Austin Hierarchical software enables efficiency • Portability • Hierarchy is least abstract common denominator • It’s what systems want • Proportionality • Co-tune hardware and software • Path to true efficiency • Co-design cycles • Maintain efficiency with new technology • How strict is the hierarchy?

  14. (c) Mattan Erez, NVIDIA Hierarchical software enables co-tuning • Locality profiles drive dynamic rebalancing

  15. (c) Mattan Erez, UT Austin Proportional and efficient resiliency • Resiliencyprinciples: • Detect fault • Correct erroneous data if possible • Contain fault • Repair/reconfigure • Restore state and re-execute • Each step can be improved with co-tuning • Ignore certain faults (allow some errors) • Detect at coarse granularity • Contain where cheapest • Re-map application instead of repairing/reconfiguring hardware • Preserve and restore minimally and effectively

  16. (c) Mattan Erez, UT Austin Hierarchical resiliency – containment domains • Containment domains enable proportionality • Match locality hierarchy with resiliency hierarchy • Efficient state preservation and restoration • Predictable (minimal) overhead • Hierarchy provides natural domains for managing faults (and rebalancing) • Co-tune resiliency scheme in HW and SW • Range of hardware error detection and correction mechanisms • Mechanisms introduce minimal overhead when not in use

  17. (c) Mattan Erez, UT Austin Containment Domains: a full-system approach to resiliency • Hierarchy provides natural domains for containing faults • Containment domains enable software-controlled resilience • Preserve data on domain start • Detect faults before domain commits • Recover: restore data and re-execute when necessary • Arbitrary nesting • Tasks • Functions • Loop iterations • Instructions • Amenable to compiler analysis • Constructs for programmer tuning

  18. (c) Mattan Erez, UT Austin Tunable error protection • High AMTTI requires strong error protection • Global redundancy overhead can be high • Hardware mechanisms can help • Can do even better with software control • Containment domains enable specialized protection • Each domain can have unique detection routine • May even be scenario specific • Redundancy can be added at any granularity

  19. (c) Mattan Erez, UT Austin State preservation and restoration • Match storage hierarchy • Utilize NV memory • Explicit software control • Trade off overheads: • Storage, local and global bandwidth, recomputation, complexity and effort

  20. Soft control errors Detect: User selectable signatures Implicit exceptions Recover: restore, re-execute Hard compute fault Detect: runtime liveness Recover: Map-out bad PE If OK w/o resource or spare available then: recover and re-exec Else: escalate failure High-level unhandled faults Detect: runtime heartbeat Recover: Escalate failure (c) Mattan Erez, UT Austin Faults and default behavior encompasses current approaches • Soft memory errors • Detect: hardware ECC • Recover: retry, if fail then restore, re-execute • Hard memory fault • Detect: runtime liveness • Recover: • Map-out bad mem • If enough space then: recover and re-exec • Else: escalate failure • Soft arithmetic error • Detect: user-selectable • Duplicated execution (HW/SW) • Other HW techniques • Algorithm-specific assert • Recover: retry, if fail then restore, re-execute

  21. (c) Mattan Erez, UT Austin Containment domains example void task<inner> SpMV( in matrix, in veci, out resi){ forall(…) reduce(…)SpMV(matrix[…],veci[…],resi[…]); } preserve {preserve_NV(matrix);} //inner restore_for_child{…} void task<leaf> SpMV(…) { for r=0..N for c=rowS[r]..rowS[r+1] { contain{ resi[r]+=data[c]*veci[cIdx[c]]; } check {fault<fail>(c > prevC);} prevC=c; } } preserve {preserve_NV(matrix);} //leaf

  22. (c) Mattan Erez, UT Austin Summary • Hierarchy is basic engineering approach • Works for hardware and works for software • Hierarchy is inevitable • Minimize movement • Amortize control • Match explicit hierarchies in HW and SW • Lowest abstract common denominator • Natural domains and boundaries enable: • Co-design • Co-tuning • Dynamic rebalancing • Resiliency

More Related