1 / 14

CSE 520: Advanced Computer Architecture: Reliability

CSE 520: Advanced Computer Architecture: Reliability. Aviral Shrivastava. Therac-25 1985-1987. The Therac-25 was a machine for administering radiation therapy, generally for treating cancer patients. ‘ arithmetic overflow’ sometimes occurred during automatic safety checks.

zeus-bruce
Download Presentation

CSE 520: Advanced Computer Architecture: Reliability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 520: Advanced Computer Architecture: Reliability Aviral Shrivastava

  2. Therac-25 1985-1987 The Therac-25 was a machine for administering radiation therapy, generally for treating cancer patients. ‘arithmetic overflow’ sometimes occurred during automatic safety checks. If, at this precise moment, the operator was configuring the machine, the safety checks would fail and the metal target would not be moved into place. The result was that beams 100 times higher than the intended dose would be fired into a patient, giving them radiation poisoning. This happened on 6 known occasions, causing the later death of 4 patients.



  3. Patriot Missile Bug - February 25th, 1991 During Operation Desert Shield, the US military fired a patriot missile against an incoming missile, but hit a US base where it killed 28 soldiers and injured a further 98.
 The internal clock would ‘drift’ (much like any clock) further and further from accurate time. It was left running for 100 hours, by which point, the internal clock had drifted out by 0.34 of a second. So when it calculated the target over half a kilometer away from missile’s true location.

  4. Skynet Brings Judgement Day (1997) Cost: 6 billion dead, near-total destruction of human civilization and animal ecosystems (fictional) Disaster: Human operators attempt to shut off the Skynet global computer network.  Skynet responds by firing U.S. nuclear missiles at Russia, initiating global nuclear war on what became known as Judgement Day (August 29, 1997). Cause:Cyberdyne, the leading weapons manufacturer, installed Skynet technology in all military hardware including stealth bombers and missile defense systems. The Skynet technology formed a seamless network and effectively removed humans from strategic defense.  Eventually Skynet became sentient, was threatened when the humans tried to take it offline, sought to survive, and retaliated with nuclear war.

  5. Cold War Missile Crisis September 26, 1983 Soviet military officer StanislavPetrovreceived an alert that the US had launched five Minuteman intercontinental ballistic missiles. Petrov found it strange that the US would attack with just a handful of warheads. Considering that the early warning system was known to have flaws and had been rushed into service, Petrov decided to rule the alert as a false alarm. It was later determined that the early detection software had picked up the sun’s reflection from the top of clouds and misinterpreted it as missile launches.


  6. Michigan Dept. of Corrections Grants Prisoners Early Release In October 2005, The Register reported on the early release of 23 prisoners due to a computer programming glitch with the Michigan Department of Corrections. The accidental early release dates came around 39 to 161 days early while an undisclosed number of inmates were kept in jail past their release dates. State assembly representative Rick Jones was concerned about the matter, but noted that he was “glad it’s not murderers.”

  7. North American Blackout August 14, 2003 Affecting around 55 million people, mainly in the North Eastern United States, but also Ontario Canada, this was one of the biggest power blackouts in history. While the causes of this blackout were nothing to do with a software bug, it could have been averted were it not for a software bug in the control centre alarm system. The centre alarm system had a ‘race condition’, which caused the alarm system to freeze and stop processing alerts. The alarm system failed ‘silently’, and didn’t notify anybody.

  8. Blue screen of death

  9. Source of Errors Assuming systems are mechanically and physically protected! • Specification errors • Functionality in footnotes • Programming errors • Incorrect implementation (Michigan prison error) • Algorithm error (Cold war missile crisis) • Floating point errors (Patriot missile) • Race conditions (Blackout) • Manufacturing errors • Process variations • Silicon failures • Runtime errors • Negative Bias Temperature Instability (NBTI) • Noise effects • Voltage emergencies • Environmental • Soft errors

  10. Fault Tolerant Computing is not new! 1940s: ENIAC, with 17.5K vacuum tubes and 1000s of other electrical elements, failed once every 2 days 1950s: Early ideas by von Neumann (multichannel, with voting) and Moore-Shannon (“crummy” relays)

  11. Need is changing: Automation Space age Age of Automation Proliferation of robots

  12. Need is changing: Proximity • Near body computing • Google glass • In-body computing • Accurate drug delivery • Robotic surgery

  13. Need is changing: Technology Transistors are smaller Even low-energy particles can cause soft errors. Exponentially more low-energy particles

  14. Welcome • To the course on designing reliable computing systems • Focus of the course will be on “soft errors” • Class webpage • http://www.public.asu.edu/~ashriva6/teaching/ARC/

More Related