220 likes | 336 Views
Mattan Erez The University of Texas at Austin Salishan 2011. Explicit HW and SW Hierarchies High-Level Abstractions for giving the system what it wants . Power and reliability bound performance. More and more components Per-component improvement too slow. 1 GW. 100 MW. 10 MW. 1 MW.
E N D
Mattan ErezThe University of Texas at Austin Salishan2011 Explicit HW and SW HierarchiesHigh-Level Abstractions for giving the system what it wants
(c) Mattan Erez, UT Austin Power and reliability bound performance • More and more components • Per-component improvement too slow 1 GW 100 MW 10 MW 1 MW 100 KW 10 KW 1 KW Exa Tera Peta
(c) Mattan Erez, UT Austin Power and reliability bound performance • More and more components • Per-component improvement too slow
(c) Mattan Erez, UT Austin What can we do? • Compute less and store less • Use better algorithms • Specialize more • But still innovate on algorithms • Waste less • Minimize movement • Dynamically rebalance hardware • Efficient resiliency for reliability • Minimize redundancy • Tradeoff inherent reliability and resiliency
(c) Mattan Erez, UT Austin Power is a zero-sum game • Tradeoff control, compute, storage, comm. • Dense algebra • Large sparse data • Building data structures
(c) Mattan Erez, UT Austin Hierarchy enables HW/SW co-tuning and co-design • Hierarchy as common abstraction for HW and SW • Basic engineering • Match abstractions • Portability to ensure progress • Co-design cycle • Portability to ensure efficiency • Co-tune for proportionality
Hardware hierarchy – locality • Communication and storage dominate energy • Closer and smaller == better • Amortize cost of global operations 20mm 20 pJ 64-bit DP DRAM Rd/Wr 26 pJ 256 pJ 16 nJ Efficient off-chip link 256-bit buses 500 pJ 50 pJ 256-bit access 8 kB SRAM 1 nJ 28nm
(c) Mattan Erez, UT Austin Locality hierarchy “minimizes” hardware • Efficiency/performance tradeoffs • Efficiency goes up as BW goes down
(c) Mattan Erez, UT Austin Hardware hierarchy – control • Specialization is a form of hierarchy • Amortize SW control decisions in HW • Sophisticated high-level control • Dynamic rebalancing • Simple low-level control • Minimize hardware waste • How far can we push this?
Dual-core PC 4 node cluster of PCs System with a GPU Cluster of dual Cell blades Main memory Main memory Aggregate cluster memory (virtual level) Main memory Main memory Aggregate cluster memory (virtual level) L2 cache GPU memory L1 cache L1 cache LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS LS Node memory Node memory Node memory Node memory ALUs ALUs SM SM SM SM SM SM SM SM SM SM … L2 cache L2 cache L2 cache L2 cache ALUs ALUs ALUs ALUs ALUs ALUs ALUs ALUs ALUs ALUs L1 cache L1 cache L1 cache L1 cache matmul_L1 32x32 matrix mult matmul_L2 256x256 matrix mult matmul_L1 32x32 matrix mult matmul_L1 32x32 matrix mult matmul_L1 32x32 matrix mult matmul_L2 256x256 matrix mult matmul_L1 32x32 matrix mult matmul_L1 32x32 matrix mult matmul_L1 32x32 matrix mult matmul_L1 32x32 matrix mult ALUs ALUs ALUs ALUs Hierarchical HW hierarchical SW • Hierarchy is least abstract common denominator matmul large matrix mult A B C . . . . . . . . . ... ...
Task hierarchies task matmul::inner( in float A[M][T], in float B[T][N], inout float C[M][N] ) { tunable int P, Q, R; mappar( int i=0 to M/P, int j=0 to N/R ) { mapseq( int k=0 to T/Q ) { matmul( A[P*i:P*(i+1);P][Q*k:Q*(k+1);Q], B[Q*k:Q*(k+1);Q][R*j:R*(j+1);R], C[P*i:P*(i+1);P][R*j:R*(j+1);R] ); } } } task matmul::leaf( in float A[M][T], in float B[T][N], inout float C[M][N] ) { for (int i=0; i<M; i++) for (int j=0; j<N; j++) for (int k=0; k<T; k++) C[i][j] += A[i][k] * B[k][j]; } Variant call graph matmul::inner matmul::leaf
Task hierarchies task matmul::inner( in float A[M][T], in float B[T][N], inout float C[M][N] ) { tunable int P, Q, R; mappar( int i=0 to M/P, int j=0 to N/R ) { mapseq( int k=0 to T/Q ) { matmul( A[P*i:P*(i+1);P][Q*k:Q*(k+1);Q], B[Q*k:Q*(k+1);Q][R*j:R*(j+1);R], C[P*i:P*(i+1);P][R*j:R*(j+1);R] ); } } } task matmul::leaf( in float A[M][T], in float B[T][N], inout float C[M][N] ) { for (int i=0; i<M; i++) for (int j=0; j<N; j++) for (int k=0; k<T; k++) C[i][j] += A[i][k] * B[k][j]; } Calling task: matmul::inner Located at level X B C A A B C Callee task: matmul::leaf Located at level Y
(c) Mattan Erez, UT Austin Hierarchical software enables efficiency • Portability • Hierarchy is least abstract common denominator • It’s what systems want • Proportionality • Co-tune hardware and software • Path to true efficiency • Co-design cycles • Maintain efficiency with new technology • How strict is the hierarchy?
(c) Mattan Erez, NVIDIA Hierarchical software enables co-tuning • Locality profiles drive dynamic rebalancing
(c) Mattan Erez, UT Austin Proportional and efficient resiliency • Resiliencyprinciples: • Detect fault • Correct erroneous data if possible • Contain fault • Repair/reconfigure • Restore state and re-execute • Each step can be improved with co-tuning • Ignore certain faults (allow some errors) • Detect at coarse granularity • Contain where cheapest • Re-map application instead of repairing/reconfiguring hardware • Preserve and restore minimally and effectively
(c) Mattan Erez, UT Austin Hierarchical resiliency – containment domains • Containment domains enable proportionality • Match locality hierarchy with resiliency hierarchy • Efficient state preservation and restoration • Predictable (minimal) overhead • Hierarchy provides natural domains for managing faults (and rebalancing) • Co-tune resiliency scheme in HW and SW • Range of hardware error detection and correction mechanisms • Mechanisms introduce minimal overhead when not in use
(c) Mattan Erez, UT Austin Containment Domains: a full-system approach to resiliency • Hierarchy provides natural domains for containing faults • Containment domains enable software-controlled resilience • Preserve data on domain start • Detect faults before domain commits • Recover: restore data and re-execute when necessary • Arbitrary nesting • Tasks • Functions • Loop iterations • Instructions • Amenable to compiler analysis • Constructs for programmer tuning
(c) Mattan Erez, UT Austin Tunable error protection • High AMTTI requires strong error protection • Global redundancy overhead can be high • Hardware mechanisms can help • Can do even better with software control • Containment domains enable specialized protection • Each domain can have unique detection routine • May even be scenario specific • Redundancy can be added at any granularity
(c) Mattan Erez, UT Austin State preservation and restoration • Match storage hierarchy • Utilize NV memory • Explicit software control • Trade off overheads: • Storage, local and global bandwidth, recomputation, complexity and effort
Soft control errors Detect: User selectable signatures Implicit exceptions Recover: restore, re-execute Hard compute fault Detect: runtime liveness Recover: Map-out bad PE If OK w/o resource or spare available then: recover and re-exec Else: escalate failure High-level unhandled faults Detect: runtime heartbeat Recover: Escalate failure (c) Mattan Erez, UT Austin Faults and default behavior encompasses current approaches • Soft memory errors • Detect: hardware ECC • Recover: retry, if fail then restore, re-execute • Hard memory fault • Detect: runtime liveness • Recover: • Map-out bad mem • If enough space then: recover and re-exec • Else: escalate failure • Soft arithmetic error • Detect: user-selectable • Duplicated execution (HW/SW) • Other HW techniques • Algorithm-specific assert • Recover: retry, if fail then restore, re-execute
(c) Mattan Erez, UT Austin Containment domains example void task<inner> SpMV( in matrix, in veci, out resi){ forall(…) reduce(…)SpMV(matrix[…],veci[…],resi[…]); } preserve {preserve_NV(matrix);} //inner restore_for_child{…} void task<leaf> SpMV(…) { for r=0..N for c=rowS[r]..rowS[r+1] { contain{ resi[r]+=data[c]*veci[cIdx[c]]; } check {fault<fail>(c > prevC);} prevC=c; } } preserve {preserve_NV(matrix);} //leaf
(c) Mattan Erez, UT Austin Summary • Hierarchy is basic engineering approach • Works for hardware and works for software • Hierarchy is inevitable • Minimize movement • Amortize control • Match explicit hierarchies in HW and SW • Lowest abstract common denominator • Natural domains and boundaries enable: • Co-design • Co-tuning • Dynamic rebalancing • Resiliency