220 likes | 237 Views
Learn how to validate a design with 42+ million transistors, 1M lines of code, and complex bugs like the FMUL instruction misbehavior. Discover strategies such as formal verification and managing design and validation in parallel for efficient verification process. Dive into the world of structural RTL models and the validation process at the cluster and full-chip levels. Get insights into managing validation teams and resources effectively.
E N D
A Billion Cycles a Day:Industrial Verification Matthew Heath Presentation to Synthesis & Verification Class May 8, 2003 Based on “Validating the Intel Pentium 4 Microprocessor” by Bob Bentley, DAC 2001
How do you verify a design with... • 42 million transistors • 1 million lines of RTL code • 600 – 1000 people working on it • A 3-year design time • Daily design changes
How do you verify a design which has bugs like this?? • The FMUL instruction, when the rounding mode is set to “round up”, incorrectly sets the sticky bit when the source operands are: src1[67:0] = X*2i+15 + 1*2i src2[67:0] = Y*2j+15 + 1*2jwhere i+j = 54 and {X,Y} are integers
And the answer is... • Hire 70+ validation engineers • Buy several thousand compute servers • Write 12,000 validation tests • Run up to 1 billion simulation cycles per day for 200 days • Check 2,750,000 manually-defined properties • Find, diagnose, track, and resolve 7,855 bugs • Apply formal verification with 10,000 proofs to the instruction decoder and FP units • This found that obscure FMUL bug!
We know why validation is hard for tools.Why is it hard for people who run them? • To meet an aggressive tapeout schedule, design and validation must occur in parallel without one blocking the other. • Validation starts before the design is done • Design changes occur while validation tests are running • Both design and validation must continue in the presence of known, unfixed bugs
The design team • 300 designers write RTL code • Refer to architectural spec, textbooks, research papers, conversations • Start with basic functionality and progressively add features according to project staging plan • Do simple self-checks along the way
The validation team • 100 validators write RTL tests • Refer to same sources as designers, plus the RTL implementation itself • Write functional tests to exercise features as they’re implemented • Run tests on RTL simulator • Diagnose failures • File bug reports in central database
The management • Collect and analyze data • Pass/fail status of tests • Bug database statistics (counts, priority, age, discovery rate, fix rate, etc.) • RTL feature implementation progress • Compare trends with project schedule • Respond if necessary • Re-allocate resources to high-risk areas • Prioritize work
SRTL = “Structural” RTL • Boolean equations; no behavioral syntax • State-accurate • RTL state maps directly to schematic state • High-level constructs supported • Macros, constants, loops, vectors • Design hierarchy • Full-chip has 6 clusters • Each cluster has several units • Each unit has tens of functional blocks • Each block has O(104) transistors • Each designer owns several functional blocks
SRTL models • Cluster and full-chip level • Full-chip models consume ~1GB of disk space • Compiled, executable SRTL code • Source code • Test environments • Include emulation of external logic • Direct control over interface signals • Pre-defined sets of signals commonly selected for tracing during test debug • Library of useful test fragments
Most design work at cluster level • Decouples cluster and full-chip validation • Designers “graft” to latest cluster models • Check-out and edit selected source files • Incremental model build • Run validation tests • Revision control system • Designers check-in edited source files • Log messages include change descriptions, author, timestamp
Cluster model release process • Designers periodically turn-in selected checked-in versions of source files • Coordinated turn-ins sometimes necessary • Cluster model builders process turn-ins • Merge changes from different versions of the same source file included in multiple turn-ins • Compile an executable cluster SRTL model • Run tests provided by the validators • Report test failures to validators and designers for debug • Acceptable models released to design team for future grafts
Full-chip model release process • Same process, different hierarchy • Cluster model builders don designer’s hat • Graft to full-chip model • Edit based on changes to recent cluster models • Incremental full-chip model build • Run full-chip validation tests • Debug failures, full-chip turn-in • Now full-chip model builders take over... • Process turn-ins from all clusters • Run full-chip validation tests again! • Release full-chip models to design team
Netbatch • 109 simulation cycles / day =10 Hz * 105 sec/day * 103 computers • Netbatch manages compute server workload • For a given SRTL model and set of tests, create a job file and send it to netbatch • Each sub-team has a netbatch allocation • Jobs exceeding allocation enter wait queue • Wait times of 24 hrs + not uncommon • Test results • Pass/fail statistics • Failure time and meaningful error message • Traces of user-selected system state
Efficiency improvements • A SRTL change made by a designer... • Appears in a cluster model 1 week later • Appears in a full-chip model 2 weeks later • Validators find bugs in released models which the designer has already fixed • “Onion peeling” vs. “whack-a-mole” debug • Temporarily disabling failing properties • Releasing models which fail some tests • System state capture and restore
Central bug database • Released model version • Failing validation test & symptoms • Root cause • Requested design change • Priority • Log of discussion among designers, validators, and managers • Status / disposition • New, ETA, test fixed, design fixed (& version), validated, dropped
Schematic formal verification • Use formal techniques because schematic simulation takes too long • Schematic design starts long before SRTL design is done • Bottom-up • Verify SRTL macros vs. library cells first • Black-box macrocells & verify block • Because SRTL is state-accurate, verification is combinational only!
D D D Q Q Q One SRTL state may map to multiple functionally equivalent schem states X Z W Y CLK W1 D Q Z X Z = X & Y MSFF (Z, W, CLK) Y W2 D Q CLK Z1 W1 X Y Z2 W2 CLK
D D D Q Q Q Retiming must be back-annotated into SRTL • Exception: Inverters Z1 X Z = X & Y MSFF (Z, W, CLK) W Y Z2 CLK Y MSFF (X, Y, CLK) Z = ~Y X Z CLK
Conclusion • Efficient verification of large-scale designs is a daunting management challenge • Design and validation are concurrent, not iterative • Possible with adequate resources and powerful tools to use the resources efficiently • Methodology constraints keep the problem tractable • Clear communication among team • Careful documentation • Progress tracking is key to staying on schedule • Motto: “If it hasn’t been verified, it doesn’t work.”
How NOT to do verification... Arnold was unhappily aware that the complete Jurassic Park program contained more than half a million lines of code, most of it undocumented, without explanation... “What are you doing, John?” “Checking the code.” “By inspection? That’ll take forever.” - Michael Crichton, Jurassic Park