1 / 10

Why do so many chips fail?

Why do so many chips fail?. Ira Chayut , Verification Architect (opinions are my own and do not necessarily represent the opinion of my employer). Failure rate of first silicon is rising.

Download Presentation

Why do so many chips fail?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Why do so many chips fail? Ira Chayut, Verification Architect (opinions are my own and do not necessarily represent the opinion of my employer)

  2. Failure rate of first silicon is rising • “… research by Collett International revealed that 52% of complex application specific integrated circuits (ASICs) required a respin and the reason was largely due to functional errors.” (http://www.techonline.com/community/ed_resource/feature_article/36655) • Who is to blame? (There must be someone to blame!) • Management – they didn’t provide enough resources • HW Engineering – they created the functional errors • Verification – they didn’t catch the functional errors • Architecture – they didn’t focus on testability • Marketing – they kept changing the specs

  3. People don’t kill chips, complexity kills chips http://www.cs.utexas.edu/users/dburger/teaching/cs395t-s99/papers/2_src.pdf(1999) — Projected numbers are a bit lower than current reality – a dual core AMD Opteron has 233 million transistors and the Intel Itanium 2 has 592 million transistors

  4. Complexity increases exponentially • Chip component count increases exponentially over time (Moore’s law) • Interactions increase super-exponentially • IP reuse and parallel design teams facilitate more functions with fewer HW engineers per function and more functions per chip • Verification effort gets combinatorially more difficult as functions are added

  5. Why verification is not able to keep up • Verification effort gets combinatorially more difficult as functions are added BUT • Verification staffing/time cannot be made combinatorially larger to compensate AND • Chip lifetimes are too short to allow for complete testing THUS • Chips will continue to have ever-increasing functional errors as chips get more complex

  6. Limiting the number of architectural and functional errors • Thorough unit-level verification testing • Small simulations run faster • Avoids combinatorial explosion of interactions • Well defined interfaces between blocks with assertions and formal verification techniques to reduce inter-block problems • Emulation or FPGA prototyping to accelerate testing

  7. How to live with functional errors • Successful companies have learned how to ship chips with functional and architectural – time to market pressures and chip complexity force the delivery of chips that are not perfect (even if that were possible). How can this be done better? • For a long while, DRAMs have been made with extra components to allow a less-than-perfect chip to provide full device function and to ship • How to do the same with architectural features? How can full device function exist in the presence of architectural or implementation omissions or errors?

  8. Architecture support • Embrace Perl’s motto: “There's More Than One Way to Do It” — allow for multiple ways of accomplishing all critical specified functions • Analogous to Design for Test (DFT) and Design for Verification (DFV), we should start thinking about Architect for Verification (AFV) [Thanks to Dave Whipp for the AFV phrase and acronym] • In some problem domains, such as networking, upper-layer protocols can recover from some silicon errors; though there is a performance penalty when this is used

  9. Architect support, continued • A programmable abstraction layer between the real hardware and user’s API can hide functional warts — hardware catches specific operations and either directs them to one of multiple hardware implementations, or signals a software trap • Pyramid minicomputers hid the assembly language from users, compiler could work around problems • Transmeta maps standard machine language to hidden processor architecture, translation software can work around problems • Soft hardware can allow chip redesign after silicon is frozen (and shipped!)

  10. Summary • Ever increasing chip complexity prevents total testing before tape-out (or even before shipping) • AFV techniques can make chip verification not subject to combinatorial explosion • We have to accept that there will be architectural and functional failures in every advanced chip that is built • Architecture support needed to allow failures to be worked around or fixed after post-silicon

More Related