A Billion Cycles a Day: Industrial Verification

A Billion Cycles a Day:Industrial Verification Matthew Heath Presentation to Synthesis & Verification Class May 8, 2003 Based on “Validating the Intel Pentium 4 Microprocessor” by Bob Bentley, DAC 2001

How do you verify a design with... • 42 million transistors • 1 million lines of RTL code • 600 – 1000 people working on it • A 3-year design time • Daily design changes

How do you verify a design which has bugs like this?? • The FMUL instruction, when the rounding mode is set to “round up”, incorrectly sets the sticky bit when the source operands are: src1[67:0] = X*2i+15 + 1*2i src2[67:0] = Y*2j+15 + 1*2jwhere i+j = 54 and {X,Y} are integers

And the answer is... • Hire 70+ validation engineers • Buy several thousand compute servers • Write 12,000 validation tests • Run up to 1 billion simulation cycles per day for 200 days • Check 2,750,000 manually-defined properties • Find, diagnose, track, and resolve 7,855 bugs • Apply formal verification with 10,000 proofs to the instruction decoder and FP units • This found that obscure FMUL bug!

We know why validation is hard for tools.Why is it hard for people who run them? • To meet an aggressive tapeout schedule, design and validation must occur in parallel without one blocking the other. • Validation starts before the design is done • Design changes occur while validation tests are running • Both design and validation must continue in the presence of known, unfixed bugs

The design team • 300 designers write RTL code • Refer to architectural spec, textbooks, research papers, conversations • Start with basic functionality and progressively add features according to project staging plan • Do simple self-checks along the way

The validation team • 100 validators write RTL tests • Refer to same sources as designers, plus the RTL implementation itself • Write functional tests to exercise features as they’re implemented • Run tests on RTL simulator • Diagnose failures • File bug reports in central database

The management • Collect and analyze data • Pass/fail status of tests • Bug database statistics (counts, priority, age, discovery rate, fix rate, etc.) • RTL feature implementation progress • Compare trends with project schedule • Respond if necessary • Re-allocate resources to high-risk areas • Prioritize work

SRTL = “Structural” RTL • Boolean equations; no behavioral syntax • State-accurate • RTL state maps directly to schematic state • High-level constructs supported • Macros, constants, loops, vectors • Design hierarchy • Full-chip has 6 clusters • Each cluster has several units • Each unit has tens of functional blocks • Each block has O(104) transistors • Each designer owns several functional blocks

SRTL models • Cluster and full-chip level • Full-chip models consume ~1GB of disk space • Compiled, executable SRTL code • Source code • Test environments • Include emulation of external logic • Direct control over interface signals • Pre-defined sets of signals commonly selected for tracing during test debug • Library of useful test fragments

Most design work at cluster level • Decouples cluster and full-chip validation • Designers “graft” to latest cluster models • Check-out and edit selected source files • Incremental model build • Run validation tests • Revision control system • Designers check-in edited source files • Log messages include change descriptions, author, timestamp

Cluster model release process • Designers periodically turn-in selected checked-in versions of source files • Coordinated turn-ins sometimes necessary • Cluster model builders process turn-ins • Merge changes from different versions of the same source file included in multiple turn-ins • Compile an executable cluster SRTL model • Run tests provided by the validators • Report test failures to validators and designers for debug • Acceptable models released to design team for future grafts

Full-chip model release process • Same process, different hierarchy • Cluster model builders don designer’s hat • Graft to full-chip model • Edit based on changes to recent cluster models • Incremental full-chip model build • Run full-chip validation tests • Debug failures, full-chip turn-in • Now full-chip model builders take over... • Process turn-ins from all clusters • Run full-chip validation tests again! • Release full-chip models to design team

Netbatch • 109 simulation cycles / day =10 Hz * 105 sec/day * 103 computers • Netbatch manages compute server workload • For a given SRTL model and set of tests, create a job file and send it to netbatch • Each sub-team has a netbatch allocation • Jobs exceeding allocation enter wait queue • Wait times of 24 hrs + not uncommon • Test results • Pass/fail statistics • Failure time and meaningful error message • Traces of user-selected system state

Efficiency improvements • A SRTL change made by a designer... • Appears in a cluster model 1 week later • Appears in a full-chip model 2 weeks later • Validators find bugs in released models which the designer has already fixed • “Onion peeling” vs. “whack-a-mole” debug • Temporarily disabling failing properties • Releasing models which fail some tests • System state capture and restore

Central bug database • Released model version • Failing validation test & symptoms • Root cause • Requested design change • Priority • Log of discussion among designers, validators, and managers • Status / disposition • New, ETA, test fixed, design fixed (& version), validated, dropped

Bug root causes

Schematic formal verification • Use formal techniques because schematic simulation takes too long • Schematic design starts long before SRTL design is done • Bottom-up • Verify SRTL macros vs. library cells first • Black-box macrocells & verify block • Because SRTL is state-accurate, verification is combinational only!

D D D Q Q Q One SRTL state may map to multiple functionally equivalent schem states X Z W Y CLK W1 D Q Z X Z = X & Y MSFF (Z, W, CLK) Y W2 D Q CLK Z1 W1 X Y Z2 W2 CLK

D D D Q Q Q Retiming must be back-annotated into SRTL • Exception: Inverters Z1 X Z = X & Y MSFF (Z, W, CLK) W Y Z2 CLK Y MSFF (X, Y, CLK) Z = ~Y X Z CLK

Conclusion • Efficient verification of large-scale designs is a daunting management challenge • Design and validation are concurrent, not iterative • Possible with adequate resources and powerful tools to use the resources efficiently • Methodology constraints keep the problem tractable • Clear communication among team • Careful documentation • Progress tracking is key to staying on schedule • Motto: “If it hasn’t been verified, it doesn’t work.”

How NOT to do verification... Arnold was unhappily aware that the complete Jurassic Park program contained more than half a million lines of code, most of it undocumented, without explanation... “What are you doing, John?” “Checking the code.” “By inspection? That’ll take forever.” - Michael Crichton, Jurassic Park

A Billion Cycles a Day: Industrial Verification

A Billion Cycles a Day: Industrial Verification

Presentation Transcript

Automatic Verification of Industrial Designs

A Word a Day

A day in the life of a Dow Corning Industrial Hygienist

A Word a Day

A day or not a day?

A day

Verification methods - towards a user oriented verification

Life Cycles of a Congregation

A one in a billion shot

A Dime A Day

Jubail Industrial City A $45 Billion Super Project

Day A

Automatic Verification of Industrial Designs