290 likes | 437 Views
Fault Tolerance and Performance Enhancement using Chip Multi-Processors. Işıl ÖZ. Outline. Introduction Related Work Dual-Core Execution(DCE) DCE for Fault Tolerance DCE with Energy Optimization Experimental Results Conclusion. CMP. Single-chip multi-core or chip multiprocessor
E N D
Fault Tolerance and Performance Enhancement using Chip Multi-Processors Işıl ÖZ
Outline • Introduction • Related Work • Dual-Core Execution(DCE) • DCE for Fault Tolerance • DCE with Energy Optimization • Experimental Results • Conclusion
CMP • Single-chip multi-core or chip multiprocessor • High system throughput • Only explicit parallelism • No single-thread performance • Idle processor cores if insufficient parallel tasks • Dual core execution • Utilize multi-cores to improve the performance for single-thread workloads • Computation redundancy
Run-Ahead Execution • Blocked by long latency cache miss • Checkpoint the processor state • Enter run-ahead mode • Blocking completes • Return normal mode • Re-execution using warmed up caches • Limitation • Re-execution even run-ahead is correct • Multiple executions for miss-dependent misses
CFP (Continual Flow Pipelines) • Similar to Run-ahead execution • Store dependent (slice) instructions • Execute independent instructions speculatively • Commit speculative results • Limitation • Requires a large centralized load/store queue
Leader-Follower Architectures • Running a program on two processors • One leader • One follower using leader’s results to make faster progress • Limitation • Leader may be slower, follower cannot use the results • Follower may be slower, leader has to wait to retire
Front Superscalar Core • Execute instructions in normal way, except • For long-latency cache misses (L2 miss) • Substitute the data fetched with invalid value • INV bit is set in the physical register • Invalidate the dependent instructions • Propagate INV flag through data dependency • Retire instructions in-order, except • Store instructions • No data cache or memory update • Update run-ahead cache to use in subsequent loads • Exceptions
Result Queue • First-in first-out structure • Keeps the retired instruction stream from the front processor • Provides continuous instruction stream to the back processor
Back Superscalar Core • Instructions are fetched from the result queue • Processes instructions in normal way, except • Mispredicted branches • All the instructions are squashed in back and front processor • The result queue is emptied • The back processor’s register values are copied into the front processor’s physical registers • Run-ahead cache is invalidated • Retires instructions in-order • Store instructions update data caches • Precise state for exception handling
Memory Hierarchy • Seperate L1 data caches for back and front processor • Shared L2 cache • L1 D-cache miss in the front processor -> prefecth request for L1 D-cache in the back • The back processor updates both L1 D-caches at the store instruction retirement
Simulation Methodology • Simulator infrastructure • SimpleScalar toolset • Baseline • MIPS-R10000-style superscalar processor • SPEC CPU 2000 benchmarks • Memory-intensive benchmarks
DCE_R • DCE for Transient Fault Tolerance • DCE with redundancy check • Compare results of front processor that are not invalid and results of back processor • In case of discrepancy • Branch misprediction recovery mechanism provide fault tolerance by rewinding the processors • Only partial redundancy coverage
Redundancy Checking Results The percentage of retired instructions with redundancy checking
DCE_FR • DCE_R with Full Redundancy Coverage • F_INV flag to each instruction to show whether it’s validated by the front processor • If invalidated, the back processor fetches the same instruction twice for normal and redundancy • If validated, the front processor result is used as redundancy • Changes in renaming logic • Redundant execution • Source operands access rename table as usual • Destination registers obtain new physical register, not update • At the retire stage, dest.registers are freed after the comparison
DCE_FR_t • DCE_FR with Renaming Scheme • Additional renaming table (A_table) to the original renaming table (R_table) • Invalidated normal execution, accesses and updates • R_table • Invalidated redundant execution, accesses and updates • A_table • Validated execution, accesses • R_table • Validated execution, updates • both R_table and A_table
Performance Impact • DCE_R and DCE_FR better than Base, except benchmarks having many branch mispredictions • DCE_R and DCE_FR not much better than DCE • DCE_FR 23.5% performance improvement
Energy Consumption • DCE_R and DCE_FR have high energy overhead
Energy Overhead Problems • Wrong-path instructions • Large instruction window • Branch misprediction results in fetching and executing large wrong-path instructions • Redundant execution for invalidated instructions • Need to access some structures (register file, access table etc.) although producing no useful results • DCE_FR has to dual-execute
Energy Overhead Solutions-1 • FR_rs • Adapting instruction window size • Reduce for high misprediction rated workloads • Keep large to exploit large-window benefits for others • FR_rs_tl • Selective invalidation • Not invalidate traversal address load • Only special “load ra, x(ra)” instructions • Due to requiring compiler support to decide load types
Energy Overhead Solutions-2 • FR_rs_tl_in • Adaptive Enable/Disable the invalidation • Based on workload’s dynamic behavior • Invalidate • Memory-intensive with moderate mispredictions, or • Memory-intensive with low mispredictions, or • Moderate memory-intensive with extremely low mispredictions • Otherwise no invalidate
Performance Impact • Not much performance improvement over DCE_FR
Energy Consumption • Significantly reduce the energy overhead
Energy Overhead Solutions-3 • Reducing redundant execution • No redundant execution for not invalidated instruction • Reexecute only loads and invalidated instructions • Switching between DCE and single-core • Workloads with high misprediction rates • Switch from the dual-core mode to single-core mode
Performance Impact • Executed instructions/ Retired instructions in the back processor • 41% in average
Conclusion • DCE • Improves the performance of single-threaded applications using CMPs • Works best with memory-intensive workloads with a low misprediction rate • Dynamic scheme which enables/disables DCE • DCE with full redundancy checking • 24.9% speedup, 87% energy overhead • DCE without reliability requirement • 34% speedup, 31% energy overhead
References • H. Zhou, “A Case for Fault-Tolerance and Performance Enhancement Using Chip Multiprocessors”, Computer Architecture Letters, Sept. 2005. • H. Zhou, “Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window,” Proc. 14th Int’l Conf. Parallel Architectures and Compilation Techniques (PACT ’05), 2005. • Yi Ma, Hongliang Gao, Martin Dimitrov, and H.Zhou, “Optimizing Dual-Core Execution for Power Efficiency and Transient-Fault Recovery”, IEEE Transactions on Parallel and Distributed Systems, vol. 18, No. 2007.