350 likes | 467 Views
IFRA Instruction Footprint Recording & Analysis for Post-Silicon Bug Localization. Sung-Boem Park Subhasish Mitra Robust Systems Group Departments of Electrical Eng. & Computer Sc. Stanford University. 1. Key Message. Post-silicon bug localization – Major bottleneck
E N D
IFRAInstruction Footprint Recording & Analysisfor Post-Silicon Bug Localization Sung-Boem Park Subhasish Mitra Robust Systems Group Departments of Electrical Eng. & Computer Sc. Stanford University 1
Key Message Post-silicon bug localization – Major bottleneck Pinpoint from system failure Bug location, exposing stimulus Existing schemes – Expensive & not scalable IFRA – New technique for processors Eliminates limitations of existing techniques 96% accuracy 1% area, ~0% performance impact 2
Outline • Motivation • IFRA Overview • Simulation Results • Conclusion
Microprocessor Development Flow Post-Silicon Validation Costs: 35% of Development Time 25% of Design Resources Design Pre-Silicon Pre-Silicon Verification POST-SILICON VALIDATION Post-Silicon Manufacturing Test “Post-silicon cost & complexity is rising faster than design cost” S. Yerramilli, VP, Intel, ITC06 Invited Address
Post-Silicon Validation Steps • Detect – Run test content in system • e.g., OS, games, functional tests • Localize – Pinpoint from system failure (e.g., crash) • Bug location – e.g., ALU, decoder, scheduler • Exposing stimulus – e.g., instruction sequence • Dominates cost [Josephson DAC06] • Root cause & Fix • Optical probing, patch / circuit edit / respin
Post-Silicon Bug Types [Josephson DAC06] Functional bugs – Incorrect logic implementation e.g., design errors Short localization time – e.g., hours to days Electrical bugs / circuit marginalities e.g., speed-path, noise, races, hold time Some voltage / temp / frequency corners LONG localization time – e.g., days to weeks Our focus 6
Existing Post-Silicon Bug Localization Flows System-based Tester-based Detect in system Detect in system Not always Possible 1 to 4 weeks Localize failure in system Reproduce failure on tester 2 days Localize on tester 3 days Major Problems Failure Reproduction System-level simulation
Instruction Footprint Recording & Analysis Design Phase Insert recorders inside chip design Non-intrusive No failure reproduction Single test run sufficient Record special info. in recorders / Run tests No Failure detected? Post-Si Validation Yes No system simulation Self-consistency against test program binary Scan out recorder contents Post-analyze offline Localized Bug: (location, stimulus)
Outline • Motivation • IFRA Overview • Hardware Support • Automated Post-Analysis Techniques • Simulation Results • Conclusion
IFRA Hardware in Superscalar Processor Branch Predictor I-TLB I-Cache FETCH ID assignment Fetch Queue Pipeline Registers Alpha 21264 Recorders DECODE Part of scan chain Decoders Pipeline Registers Recorders DISPATCH Reg Map Reg Free Reg Rename Post-Trigger Generator Pipeline Registers Recorders ISSUE Instruction Window Phys Regfile Slow wire Pipeline Registers Recorders No at-speed routing MUL 2xALU D-Cache EXECUTE 2xBr FPU 2xLSU D-TLB Pipeline Registers Recorders COMMIT Reorder Buffer Reg Map Pipeline Registers Scan chain Recorders
Recording Operation Example Special ID assignment rule Branch Predictor I-TLB I-Cache FETCH ID Assignment Fetch Queue INST2 Auxiliary Info: PC2 ID2 Auxiliary Info: PC1 ID1 INST1 Recorder 1 Pipeline Reg INST2 INST1 ID1 ID2 ID2 Auxiliary Info: PC2 ID1 Auxiliary Info: PC1 Decoder DECODE Instruction Footprints INST2 Auxiliary Info: Decoded bits2 ID2 INST1 ID1 Auxiliary Info: Decoded bits1 Recorder 2 Pipeline Reg INST1 ID1 ID2 Auxiliary Info: Decoded bits2 INST2 ID2 ID1 Auxiliary Info: Decoded bits1
Special Rule for Instruction ID Assignment Simplistic ID assignment inadequate Speculation + flushes, out-of-order execution PC does not work for loops Special ID assignment rule – formal proof in paper ID width: log24n bits n = max. instructions in flight e.g., 8 bits for Alpha-like processor (n=64) No timestamp or global synchronization required 13
Instruction Footprint Recorder Design Instruction ID + Auxiliary info. • Dominated by memory • Simple control logic • Idle cycle compaction • Circular buffer control • Serialization • Stop / Start recording • No high-speed global routing • Contents scanned out after failure detection Post-triggersignal Circular Buffer Control Logic To slow scan chain 14
What to Record? Total required storage for all recorders: 60 KBytes
Error after a billion cycles (e.g., speedpath) Failure after 2 billion cycles (e.g., crash) Post-Trigger Generation Code Execution time t=0 Too much storage overhead to store 1 billion cycles
Error after a billion cycles (e.g., speedpath) Failure after 2 billion cycles (e.g., crash) Post-Trigger Generation Code Execution time t=0 Early failure detection necessary Need to capture in recorder storage • Early failure detection techniques (post-triggers) • Classical error detection – residue, parity • Deadlock & segfault detection • Special early warnings to pause recording • Details in paper
IFRA Area Impact • 1% chip-level area impact • Synopsys Design Compiler synthesis • Alpha 21264-like processor: 2MB L2 cache • TSMC 130nm technology • No global at-speed routing • Area dominated by circular buffers in recorders • Total recorder storage: 60 KBytes
Outline • Motivation • IFRA Overview • Hardware Support • Post-Analysis Techniques • Simulation Results • Conclusion
Post-Analysis Overview Test program binary Footprints from recorders Link footprints (Not covered today – Details in paper) Control-flow analysis Data-dependency analysis Decoding analysis Load/Store analysis Run high-level analysis Run low-level analysis Residue consistency check List of bug location-stimulus pairs
Linking Footprints from Recorder Contents Test program binary Fetch-stage recorder Commit-stage recorder Execution-stage recorder PC6 PC5 PC4 PC3 PC2 PC0 INST6 INST5 INST4 INST3 INST2 INST0 ID: 0 PC5 … … … … … … ID: 0 AUX18 PC7 INST7 ID: 0 AUX8 ID: 7 ID: 6 ID: 5 ID: 4 ID: 7 ID: 6 ID: 5 AUX7 AUX6 AUX5 AUX4 AUX3 AUX2 AUX1 ID: 6 ID: 5 ID: 4 ID: 7 ID: 6 ID: 5 AUX17 AUX16 AUX15 AUX14 AUX12 AUX11 ID: 7 ID: 6 ID: 5 ID: 4 ID: 7 ID: 6 ID: 5 PC4 PC3 PC2 PC1 PC3 PC2 PC1 time ID: 0 PC4 ID: 0 AUX13 PC1 INST1 ID: 0 AUX0 ID: 0 PC0 ID: 0 AUX10 … … • Special ID assignment rule ensures: • Uncommitted instructions uniquely identified • Relative orders of identical IDs maintained • Even under flushes & out-of-order execution
Debug Example Link footprints ? ? ? ? High-level analysis ? ? Low-level analysis ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Bug locations + exposing stimulus
Debug Example – Decision 1 Test Program Binary Fetch-stage recorder … R0 R1 + R2 R0 R3 + R6 R5 R0 + R6 … Serial execution trace
Debug Example – Question 1 Residue of values mismatch? … R0 R1 + R2 Producer of R0 R0 R3 + R6 Issue-stage recorder Execute-stage recorder RAW hazard R0=3 R0=5 R5 R0 + R6 Consumer of R0 … Serial execution trace
Debug Example – Question 2 Residue of phys. reg. names mismatch? Dispatch-stage recorder … R0 R1 + R2 R0=P5 Producer of R0 R0 R3 + R6 R0=P2 RAW hazard R5 R0 + R6 Consumer of R0 … Serial execution trace
Debug Example – Question 3 Residue of phys. reg. name match with previous producer? Dispatch-stage recorder … Previous producer R0 R1 + R2 R0=P5 R0=P5 Producer of R0 R0 R3 + R6 RAW hazard R5 R0 + R6 Consumer of R0 … Serial execution trace
Debug Example – Result Pipeline Register R0 R1 + R2 R0 R3 + R6 R5 R0 + R6 … Decoder Stimulates Bug Arch. Dest. Reg Rest of pipeline reg. Bug Location Write Circuit Read Circuit … Propagates to failure Rest of modules in dispatch stage … Reg. Mapping
Outline • Motivation • IFRA Overview • Simulation Results • Conclusion
Experimental Setup • Simplescalar architectural simulator • Alpha 21264 configuration • Augmented with ~1K error injection points • Error model – single bit-flips • Hard-to-repeat electrical bugs • Both flip-flops & combinational logic • Stimulus • SpecInt 2000 benchmarks
Experimental Flow Warm up for a million cycles 100K simulation runs 800 post-analysis runs Inject error Masked/silent error Short error latency? No Yes No Any failure detected? Post-analyze Yes Complete miss Exact localization Localization with candidates
IFRA Bug Localization Results Exact localization (78%) Correct localization (96%) Complete miss (4%) Localization with avg. 6 candidates (22%) • Localization resolution • Bug exposing stimulus • One of 200 erroneous design blocks • Avg. block size: 10K 2-input NAND gates
Outline • Motivation • IFRA Overview • Simulation Results • Conclusion
Conclusion IFRA Inexpensive 1% area, no expensive logic analyzers No failure reproduction or system simulation Effective 96% accuracy Practical Alpha processor demonstration 33
Acknowledgement • Bob Gottlieb, Intel • Nagib Hakim, Intel • Ted Hong, Stanford University • Doug Josephson, Intel • Onur Mutlu, Microsoft Research • Priyadarshan Patra, Intel • Eric Rentschler, AMD • Jason Stinson, Intel