260 likes | 389 Views
AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs. Presenter: Lin Huang Lin Huang and Qiang Xu CU hk RE liable computing laboratory (CURE) The Chinese University of Hong Kong. Lifetime Reliability Becomes A Serious Concern. Failure mechanisms
E N D
AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs Presenter: Lin Huang Lin Huang and Qiang Xu CUhk REliable computing laboratory (CURE) The Chinese University of Hong Kong
Lifetime Reliability Becomes A Serious Concern Failure mechanisms Electromigration NBTI TDDB Reliability-related factors Temperature Supply voltage Frequency Infant mortality Useful life Wearout 90nm 130nm 180nm Failure rate Time [T. M. Mak] < 7 year ~ 7 year ~ 10 year
Design-Stage Decisions Affect Lifetime Reliability DPM / DTM DVFS Timeout Thermal throttling Power gating … Redundancy Level Quantity … Task Allocation Round-robin Optimized … SPECIFICATION IC • Functionality • Power consumption • Area constraint • Thermal issue • Expected service life • … Without an efficient yet accurate lifetime reliability simulation framework, making the good decisions is extremely difficult if not impossible !
The Challenges in Simulation-Based Lifetime Reliability Analysis • Increasing failure rate • Exponential distribution assumption in previous work Infant mortality Useful life Wearout Failure rate Time
The Challenges in Simulation-Based Lifetime Reliability Analysis • Operational temperature varies significantly and rapidly Obtained with HotSpot 4.0 [Huang-ieeetc08] How to achieve efficient yet accurate lifetime reliability simulation with such limited information, when failure mechanisms follow arbitrary failure distributions?
Key Idea • General failure distribution with general scale parameter by which time is divided • Example: Weibull failure distribution • Suppose we can express the reliability function as and can be computed according to limited tracing information • Example: reliability function
Key Idea • Aging rate • Capture the impact of certain usage strategy • Reliability-related usage strategy • A combination of … • Dynamic power/thermal management • Trigger mechanism • Load-sharing strategy • … given the application flow with certain characteristic
Aging rate Key Idea Temperature USAGE STRATEGY Supply voltage Frequency Representative workload Future Past
Key Idea Representative workload Future Past
Power State Machine Trigger Mechanism Application Flow Load-sharing Strategy Redundancy Scheme Proposed Simulation Framework: AgeSim– Step One: Simulation and Tracing Temperature (Data) Power / Thermal Manager Temperature Simulator Execution Mode Power Simulator Power (Data) time step
Temperature (Data) Power State Machine Power / Thermal Manager Trigger Mechanism Temperature Simulator Reliability- Related Factors Trace File Application Flow Execution Mode Load-sharing Strategy Power Simulator Redundancy Scheme Power (Data) Proposed Simulation Framework: AgeSim– Step One: Simulation and Tracing
& Reliability- Related Factors Trace File Aging rate & Proposed Simulation Framework: AgeSim– Step Two: Aging Rate Calculation
Model Validation By average temperature 28.3% error in MTTF By AgeSim almost identical results
DVFS1 Low voltage: 90%Vdd DVFS2 Low voltage: 80%Vdd No DVFS Case Study IDynamic Voltage and Frequency Scaling Task departure HV Run HV Idle Task arrival T>TH T<TL Task departure LV Run LV Idle Task arrival
Case Study IDynamic Voltage and Frequency Scaling • System load • The ratio between task arrival rate and service rate
Case Study IDynamic Voltage and Frequency Scaling • System load • The ratio between task arrival rate and service rate
Case Study IITask Allocation on Multi-Core Processors Example Chip Frequency Map • Random allocation • Performance-aware allocation • Always choose the available core with highest frequency [Sarangi-ieeetsm08]
Case Study IITask Allocation on Multi-Core Processors • System load • The ratio between task arrival rate and service rate
Discussion on the Flexibility of AgeSim • Task allocation and scheduling for MPSoC under lifetime reliability constraint • Multiprocessor with different redundancy schemes • Example: gracefully degrading redundancy, standby redundancy
Conclusion • Lifetime reliability has become a serious concern for high-performance ICs • Design stage decisions significantly affect system reliability • We propose an efficient yet accurate simulation framework to evaluate the system reliability under various usage strategy • Arbitrary failure distribution • Fine-grained tracing for representative workloads • AgeSim is effective and flexible
AgeSim: A Simulation Framework for Evaluating the Lifetime Reliability of Processor-Based SoCs Thank you for your attention !
Backup Slides • Multiple representative workload • Aging rate • Accuracy • Key idea
Multiple Representative Workloads • The proposed method could be easily extended to analyze the system with multiple representative workloads • We can organize the workloads into a hyper-workload with their occurrence probabilities • We can extract the aging rate and occurrence probability for each workload and then compute the unified aging rate by
Aging Rate • Aging rate is independent of time Failure rate Time
Accuracy
Power State Machine Trigger Mechanism Application Flow Load-sharing Strategy Redundancy Scheme Key Idea Processor usage strategy Aging rate Reliability function Power State Machine Trigger Mechanism Application Flow Load-sharing Strategy Redundancy Scheme