740 likes | 885 Views
Introduction. Exploiting Accidental HeterogeneityThree key concepts:HeterogeneityAccidental HeterogeneityExploiting Accidental Heterogeneity. 2. ECE 753 (Spring 2011) University of Wisconsin, Madison. I - Heterogeneity. Multicore processors with cores that have different instruction set archite
E N D
1. Exploiting Accidental Heterogeneity in Multicore Processors Raghuvardhan Moola
Varun Vats 1 ECE 753 (Spring 2011) University of Wisconsin, Madison
2. Introduction Exploiting Accidental Heterogeneity
Three key concepts:
Heterogeneity
Accidental Heterogeneity
Exploiting Accidental Heterogeneity 2 ECE 753 (Spring 2011) University of Wisconsin, Madison
3. I - Heterogeneity Multicore processors with cores that have different instruction set architectures (ISAs).
General purpose cores and some specialized cores.
Reason:
General purpose processors not best for optimal power/performance.
Different applications require different types of cores.
Financial applications may require DMR or TMR. 3 ECE 753 (Spring 2011) University of Wisconsin, Madison
4. II - Accidental Heterogeneity Heterogeneous cores becoming popular, but homogeneous cores still reign supreme
easier to program and provide more consistent performance
Originally homogeneous cores can become heterogeneous not by design, but due to defects.
Renders core unable to execute certain instructions.
Modern processors contain large amount of redundant logic solely for performance gains.
Such redundant structures can be used to compensate for defective logic. 4 ECE 753 (Spring 2011) University of Wisconsin, Madison
5. III - Exploiting Accidental Heterogeneity Traditional approach in failure scenario: abandon the core.
TMR DMR approaches used in Tandem NonStop, IBM zSeries.
Drawbacks:
Waste of hardware resources.
Rapid performance degradation.
Reduced lifetime.
Cost inefficient.
However, defects tolerable if properly managed.
Possible to salvage healthy components of the faulty core and keep core functional.
Reduced performance, reduced execution ability, but extended lifetime and better utilization of resources. 5 ECE 753 (Spring 2011) University of Wisconsin, Madison
6. III - Exploiting Accidental Heterogeneity Cores should have the ability to isolate defective units and reconfigure to keep core functional.
Key requirement RECONFIGURABILITY
Three sub-tasks of reconfiguration:
Fault detection: detect presence of fault.
Fault diagnosis: identify faulty component.
Reconfiguration/Recovery: isolate faulty component and restore system to a functional state, leveraging some form of redundancy. 6 ECE 753 (Spring 2011) University of Wisconsin, Madison
7. Granularity of Reconfiguration Support for reconfiguration at various levels, from ultrafine grain systems (replace individual logic gates) to coarser designs (isolate entire processor cores).
Trend
Potential lifetime enhancement increases with finer granularity.
Complexity of implementation increases with finer granularity.
Trade-off between ease of implementation and lifetime extension.
7 ECE 753 (Spring 2011) University of Wisconsin, Madison
8. Our Contribution Classification
Based on granularity of reconfiguration support.
Comparison and Evaluation
Area overhead
Performance cost
Lifetime throughput
Complexity of implementation
Targeted faults
Identify promising techniques 8 ECE 753 (Spring 2011) University of Wisconsin, Madison
9. Our Contribution Classification
Comparison and Evaluation
Identify promising techniques 9 ECE 753 (Spring 2011) University of Wisconsin, Madison
10. Classification Granularity levels (in increasing order of granularity):
Gate level
Microarchitectural/Module level
Stage level
Architectural level
Core level
10 ECE 753 (Spring 2011) University of Wisconsin, Madison
11. Reconfiguration Granularity: Gate Level System replaces individual logic gates as they fail.
Advantages
Highest lifetime extension
High production yield
Highly dependable
Drawbacks
Highly complicated to implement
Tremendous area overhead due to redundant gates and routing.
11 ECE 753 (Spring 2011) University of Wisconsin, Madison
12. Reconfiguration Granularity: Microarchitectural/Module Level System replaces microarchitectural structures/modules.
ALU, branch predictor etc.
Advantages
Comparatively easier to implement
Drawbacks
Suitable primarily for superscalar cores that have good amount of inherent redundancy for performance gains.
Few opportunities exits for reconfiguration as most modules are unique.
Performance loss. 12 ECE 753 (Spring 2011) University of Wisconsin, Madison
13. Reconfiguration Granularity: Stage Level System replaces stages as they fail.
Advantages
Stages convenient boundary for reconfiguration as cores divide work at level of stages.
Easier to implement
Challenges
Pipeline stages tightly coupled, so difficult to isolate/replace.
Maintaining spare stages area intensive.
13 ECE 753 (Spring 2011) University of Wisconsin, Madison
14. Reconfiguration Granularity: Architectural Level Defect in a core renders it unable to execute certain instruction.
Un-executable instructions moved to different core.
Advantages
Low area overhead.
Fairly easy to implement.
Drawbacks
Rapid performance degradation.
Low lifetime enhancement. 14 ECE 753 (Spring 2011) University of Wisconsin, Madison
15. Reconfiguration Granularity: Core Level Degraded functionality of core is used.
Advantages
Easy to implement
Low area overhead
Drawbacks
Poor lifetime enhancement
Poor utilization of hardware resources 15 ECE 753 (Spring 2011) University of Wisconsin, Madison
16. Reconfigurable Multicore Processor Architectures Granularity levels:
Gate level approaches
Fine Grain Redundancy (FGR)
Microarchitectural/Module level approaches
Stage level approaches
Architectural level approaches
Core level approaches
16 ECE 753 (Spring 2011) University of Wisconsin, Madison
17. FGR: Fine Grain RedundancyT. Nakura, K. Nose, and M. Mizuno. Fine-Grain Redundant Logic Using Defect-Prediction Flip-Flops IEEE International Solid State Circuits Conference 2007. Key Idea
Fine-grain redundant logic for switching defective portion
Defects can be killer or latent
Killer : Defects apparent at the fabrication process
Latent : Defects apparent only in actual use
17 ECE 753 (Spring 2011) University of Wisconsin, Madison
18. FGR: Fine Grain Redundancy ECE 753 (Spring 2011) University of Wisconsin, Madison 18 Latent Defects :
Partial insufficiency, cracked vias, extra-metal etc., which develop into the opening or shorting of connections
Path Delay Increase:
Latent defects gradually appear as path delay increase in use
19. FGR: Fine Grain Redundancy ECE 753 (Spring 2011) University of Wisconsin, Madison 19 Implementation
20. FGR: Fine Grain Redundancy ECE 753 (Spring 2011) University of Wisconsin, Madison 20 Advantages
Enhances a production yield of 70% to 91%
Prevents 80% of in-filed failures caused by one or two latent defects
Highly dependable chip market like automotive industry
Drawbacks
If the area ratio of combinational logic/DFF is 6:4, the area becomes about 2.5x larger
Area penalty would be 18% at 45nm
21. Reconfigurable Multicore Processor Architectures Reconfiguration levels
Gate level approaches
Microarchitectural/Module level approaches
Rescue
Structural Duplication
Stage level approaches
Architectural level approaches
Core level approaches
21 ECE 753 (Spring 2011) University of Wisconsin, Madison
22. RescueE.Schuchman and T.N.Vijaykumar, Rescue: A microarchitecture for testability and defect tolerance, ISCA 2005. Key Idea
Out-of-order multiple-issue superscalar cores may be thought as two in-order half-pipelines
Frontend and backend connected by issue
Disable the entire half-pipeline way that is affected by the fault
22 ECE 753 (Spring 2011) University of Wisconsin, Madison
23. Rescue ECE 753 (Spring 2011) University of Wisconsin, Madison 23 Possible degradations tolerated:
Front end supports degraded fetch, decode and rename
Issue queue and the load/store queue can be degraded to half their original size
Backend supports faults in register read, execute and memory or writeback
Extra logic added
Shifter stage after fetch so that the instructions can be shifted around
Shifter stage after issue to route issued instructions to functional ways
24. Rescue ECE 753 (Spring 2011) University of Wisconsin, Madison 24 Example : Fetch Stage
Instructions are fetched in parallel and passed in program order to the decode stage
If one or more of the frontend ways are faulty:
Assign the earliest instruction to the first fault-free way, second instruction to the second fault-free way and so on..
Stall fetch and assign any remaining instructions until all fetched instructions are processed
Routing stage is composed of muxes for each frontend way to choose an instruction for that way
25. Rescue ECE 753 (Spring 2011) University of Wisconsin, Madison 25 IPC Degradation:
26. Rescue ECE 753 (Spring 2011) University of Wisconsin, Madison 26 Advantages
Reduces IPC only by 4%
Improves instruction throughput over core sparing by 12% and 22% at 32nm and 18nm, respectively
Overhead
27. Structural DuplicationJ.Srinivasan, S.V.Adve, P.Bose, and J.A.Rivers, Exploiting Structural Duplication for Lifetime Reliability Enhancement, ISCA 2005. ECE 753 (Spring 2011) University of Wisconsin, Madison 27 Key Idea
Exploit microarchitectural redundancy for reliability enhancement.
Three techniques:
Structural Duplication (SD)
Redundant structures added to processor and designated as spares.
Gradual Performance Degradation (GPD)
Based on inherent redundancy, which is not required for functional correctness.
Exploited to improve reliability or extend lifetime.
SD + GPD
Inherent redundancy as well as spares added for reliability enhancement.
28. Structural Duplication ECE 753 (Spring 2011) University of Wisconsin, Madison 28 Cost incurred:
SD
Addition of cold spares causes area overhead.
Performance not affected.
GPD
Performance boosting redundant structures used for lifetime extension, so no area overhead.
Performance degrades on occurrence of fault.
29. Structural Duplication ECE 753 (Spring 2011) University of Wisconsin, Madison 29 Performance, Area and MTTFs
30. Structural Duplication ECE 753 (Spring 2011) University of Wisconsin, Madison 30 Lifetimes
31. Structural Duplication ECE 753 (Spring 2011) University of Wisconsin, Madison 31 Performance/Cost (P/C) metric
32. Structural Duplication ECE 753 (Spring 2011) University of Wisconsin, Madison 32 Limitations
Not all structures replicated. Failures occurring in non-replicated structures will cause core to fail.
Groups of structures allowed to degrade are few and hence lifetime extension is not much.
33. Reconfigurable Multicore Processor Architectures Reconfiguration levels
Gate level approaches
Microarchitectural/Module level approaches
Stage level approaches
Core Cannibalization Architecture
StageNet
StageWeb
Architectural level approaches
Core level approaches
33 ECE 753 (Spring 2011) University of Wisconsin, Madison
34. Core Cannibalization ArchitectureB.F.Romanescu and D.J.Sorin, Core cannibalization architecture: Improving lifetime chip performance for multicore processors in the presence of hard faults, PACT 2008. Two types of cores
Cannibalizable Cores (CC): whose stages can be cannibalized when fault occurs.
Non-cannibalizable Cores (NC): whose stages cannot be cannibalized.
In absence of faults, CCs function like normal cores.
When fault occurs, CCs stages are cannibalized.
Stages replaced only if the fault occurs in NCs. 34 ECE 753 (Spring 2011) University of Wisconsin, Madison
35. Core Cannibalization Architecture ECE 753 (Spring 2011) University of Wisconsin, Madison 35
36. Core Cannibalization Architecture ECE 753 (Spring 2011) University of Wisconsin, Madison 36 Placement of CC critical.
37. Core Cannibalization Architecture ECE 753 (Spring 2011) University of Wisconsin, Madison 37 Results
Area Overhead
38. Core Cannibalization Architecture ECE 753 (Spring 2011) University of Wisconsin, Madison 38 Results
Lifetime Performance
39. Core Cannibalization Architecture ECE 753 (Spring 2011) University of Wisconsin, Madison 39 Results
Cumulative Performance Advantage
40. StageNetS.Gupta, Shuguang Feng, A. Ansari, S.Mahalke, StageNet: A Recon?gurable Fabric for Constructing Dependable CMPs, IEEE Transactions on Computers 2011. Key Idea
Multicore processor designed as a reconfigurable network of processor pipeline stages.
Pipeline stages are isolated processing elements that can be connected in arbitrary fashion to form a logical core.
Network formed by replacing pipeline registers with crossbar switches.
40 ECE 753 (Spring 2011) University of Wisconsin, Madison
41. StageNet ECE 753 (Spring 2011) University of Wisconsin, Madison 41
42. StageNet ECE 753 (Spring 2011) University of Wisconsin, Madison 42 Two major components
Configuration Manager
Interconnection Crossbars
Configuration Manager
Constructs logical cores from the pool of available stages at boot up.
Sets up the routing table on each pipeline stage.
Implemented in OS (higher flexibility).
Interconnection Crossbars
Direct incoming operations to the correct destination stage.
Destination stage identified using routing tables on each stage.
43. StageNet ECE 753 (Spring 2011) University of Wisconsin, Madison 43 Fault Tolerance
StageNet Islands
44. StageNet ECE 753 (Spring 2011) University of Wisconsin, Madison 44 Results
Area Overhead
45. StageNet ECE 753 (Spring 2011) University of Wisconsin, Madison 45 Results
Lifetime Performance
46. StageNet ECE 753 (Spring 2011) University of Wisconsin, Madison 46 Results
Cumulative Performance
47. StageNet ECE 753 (Spring 2011) University of Wisconsin, Madison 47 Limitations
Doesnt scale well (fully connected pipeline stages increase delay)
Crossbar switches vulnerable to failure (single points of failure)
Cold spares increase area drastically
Process variations not addressed
48. StageWebShantanu Gupta, Amin Ansari, Shuguang Feng, and Scott Mahlke StageWeb: Interweaving Pipeline Stages into a Wearout and Variation Tolerant CMP Fabric, DSN 2010. Builds upon StageNet
Addresses the three limitations:
Scalability
Network failure tolerance
Resilience to process variations
48 ECE 753 (Spring 2011) University of Wisconsin, Madison
49. StageWeb: Scalability and Reliability ECE 753 (Spring 2011) University of Wisconsin, Madison 49 Single crossbar configuration
Island 1 is unable to form any logical SNS
Island 2 forms one logical SNS (SNS 0)
50. StageWeb: Scalability and Reliability ECE 753 (Spring 2011) University of Wisconsin, Madison 50 Overlapping crossbar configuration
Island 1 is not able to form any logical SNS
Island 2 and 3 form one logical SNS each.
51. StageWeb: Scalability and Reliability ECE 753 (Spring 2011) University of Wisconsin, Madison 51 Overlapping and front-back crossbar configuration
One more logical SNS (SNS 2) added over the overlapping crossbar con?guration, resulting in three SNSs.
52. StageWeb ECE 753 (Spring 2011) University of Wisconsin, Madison 52 Throughput cost
53. StageWeb ECE 753 (Spring 2011) University of Wisconsin, Madison 53 Cumulative work
54. StageWeb: Interconnect Reliability ECE 753 (Spring 2011) University of Wisconsin, Madison 54 Simple crossbar
single crossbar switch used at each interconnection spot.
No redundancy is maintained.
Simple Crossbar with spares
One cold spare maintained for every crossbar in the system.
Brought into use when the original develops certain number of faults.
Fault-Tolerant Crossbar (no spares)
Multiple paths exist from each input to output port.
Nearly eliminates chances of crossbar failures.
Most expensive option (2x to 3x of simple crossbar).
55. Reconfigurable Multicore Processor Architectures Reconfiguration levels
Gate level approaches
Microarchitectural/Module level approaches
Stage level approaches
Core level approaches
Elastic
Necromancer
Architectural level approaches
55 ECE 753 (Spring 2011) University of Wisconsin, Madison
56. ElastIC*D. Sylvester, D. Blaauw, and E. Karl. Elastic: An Adaptive Self-Healing Architecture for Unpredictable Silicon, IEEE Design and Test 2006. ECE 753 (Spring 2011) University of Wisconsin, Madison 56 Key Idea
Employs run-time self-diagnosis to keep track of performance.
Four key components:
Processing Elements (PE): simple processors that contain reliability, performance and power monitors.
Diagnostic and Adaptive Processing Unit (DAP): performs detailed diagnostics of PEs (parametric variation and wear-out).
Memory and Interconnect Systems: Use ECC and redundancy to tackle functional failures.
Scheduler: examines the state of each PE and distributes workload accordingly.
*No simulation studies available.
57. ElastIC ECE 753 (Spring 2011) University of Wisconsin, Madison 57 DAP
Conducts power and performance characterization of PE by testing its operation at different frequencies and voltages.
Can initiate active healing of damaged components by taking advantage of reversibility of several reliability effects like NBTI and electromigration.
Made immune to failures by using aggressive redundancy.
Scheduler:
Uses data produced by DAP to maximize performance by controlling PEs voltage and frequency.
Also steers processor traffic based on this data.
58. ElastIC ECE 753 (Spring 2011) University of Wisconsin, Madison 58 Limitations
Not scalable to massively multicore architectures as area and power overhead will be high.
Possibly very complex to implement.
59. NecromancerAmin Ansari, Shuguang Feng, Shantanu Gupta, and Scott Mahlke Necro-mancer: Enhancing System Throughput by Animating Dead Cores, ISCA 2010. Key idea
Execution traces on a defective core resembles fault-free execution
Partition the cores in a CMP into multiple groups
Each group shares a lightweight core
59 ECE 753 (Spring 2011) University of Wisconsin, Madison
60. Necromancer ECE 753 (Spring 2011) University of Wisconsin, Madison 60 Relax the correct execution constraint on a faulty core
Define a Similarity Index (SI) measures similarity between the PC
For SI 90% - at least 100K instructions before execution differs by 10%
Leverage high level execution information (hints) from faulty core to accelerate animator core
Disable the hints if they are not profitable
61. Necromancer ECE 753 (Spring 2011) University of Wisconsin, Madison 61 Resynchronize the faulty core whenever it goes too far from the correct execution path
Takes about 100 cycles
At least 100K committed instructions 85% cases
Less synchronization overhead
62. Necromancer ECE 753 (Spring 2011) University of Wisconsin, Madison 62 High level Architecture
63. Necromancer ECE 753 (Spring 2011) University of Wisconsin, Madison 63 Can achieve 87.6% performance of a fully functioning core
Area and power overheads are 5.3% and 8.5% respectively
64. Reconfigurable Multicore Processor Architectures Reconfiguration levels
Gate level approaches
Microarchitectural/Module level approaches
Stage level approaches
Core level approaches
Architectural level approaches
Architecture Core Salvaging
64 ECE 753 (Spring 2011) University of Wisconsin, Madison
65. Architectural Core SalvagingM.Powell, A.Biswas, S.Gupta, S.Mukherjee, Architectural Core Salvaging in a Multi-Core Processor for Hard Error Tolerance, ISCA 2009. ECE 753 (Spring 2011) University of Wisconsin, Madison 65 Key Idea
Even if individual cores cannot execute certain operations, CPU die can still be ISA complaint
Migrate the offending thread to another core that can execute the instruction
Find a stable thread that does not utilize un-executable instructions and assign it to defective core
66. Architectural Core Salvaging ECE 753 (Spring 2011) University of Wisconsin, Madison 66 Relax the requirement
Each core need not be fully functional
Non-replicated structures non-essential
Potential of the method
67. Architectural Core Salvaging ECE 753 (Spring 2011) University of Wisconsin, Madison 67 Implementation :
Detecting the presence of un-executable instructions
Programmable lookup table
Transferring the architectural state to and from the core
Similar to deep-sleep power state
Migration and Overhead
Thread migration is thread swap
Done over the existing interconnect
Order of tens to a few hundred cycles
Can be amortized as long as they are infrequent
68. Architectural Core Salvaging ECE 753 (Spring 2011) University of Wisconsin, Madison 68 Advantages
Covers a significant fraction of the core area
30% of the vulnerable area
Requires only small changes to the microarchitecture
69. Our Contribution Classification
Comparison and Evaluation
Identify promising techniques 69 ECE 753 (Spring 2011) University of Wisconsin, Madison
70. Comparison and Evaluation ECE 753 (Spring 2011) University of Wisconsin, Madison 70
71. Our Contribution Classification
Comparison and Evaluation
Promising Approaches 71 ECE 753 (Spring 2011) University of Wisconsin, Madison
72. Promising Approaches Mircoarchitectural techniques
Fairly low area overhead
Achieve a performance level quite close to the base processor
Stage Level techniques
Low area overhead
Have a high lifetime throughput
Architectural Level techniques
Low area overhead
High performance
Low lifetime throughput
72 ECE 753 (Spring 2011) University of Wisconsin, Madison
73. Conclusion Studied and classified reconfigurable multicore processor architectures based on granularity of reconfiguration.
Architectures compared and evaluated based on area overhead, performance cost, implementation complexity and targeted faults.
Techniques employing reconfigurability at Microarchitectural level, Stage level and Architectural level identified to be efficient.
73 ECE 753 (Spring 2011) University of Wisconsin, Madison
74. Questions? 74 ECE 753 (Spring 2011) University of Wisconsin, Madison