1 / 57

the e-risks of e-commerce

the e-risks of e-commerce. Professor Ken Birman Dept. of Computer Science Cornell University. Reliability. If it stops ticking when it takes a licking… your e-commerce company could tank So you need to know that your technology base is reliable

presta
Download Presentation

the e-risks of e-commerce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. the e-risks of e-commerce Professor Ken Birman Dept. of Computer Science Cornell University

  2. Reliability • If it stops ticking when it takes a licking… your e-commerce company could tank • So you need to know that your technology base is reliable • It does what it should do, does it when needed, does it correctly, and is accessible to your customers.

  3. A Quiz • Q: When and why did Sun Microsystems have a losing quarter? Ken Birman: Mr. Birman,Sun experienced a loss in Q4FY89 (June 1989). This was the quarter in which we transitioned to a new manufacturing, order processing and inventory control systems.Andrew CaseyManager, Investor RelationsSun Microsystems, Inc.(650) 336-0761andrew.casey@corp.sun.com

  4. Typical Web Session get http://www.cs.cornell.edu/People/ken where what firewall

  5. DNS root DNS node DNS node DNS leaf DNS root DNS leaf DNS root DNS leaf caching proxy caching proxy web server web server Typical Web Session resolve “www.cs.cornell.edu” IP address128.64.31.77 load-balancing proxy firewall

  6. The Web’s dark side Netscape error: web server www.cs.cornell.edu... not responding. Server may have crashed or is overloaded. OK

  7. Right URL, but the request times out. Why? • The web server could be down • Your network connection may have failed • There could be a problem in the “DNS” • There could be a network routing problem • The Internet may be experiencing an overload • Your web caching proxy may be down • Your PC might have a problem, or your version of Netscape (or Explorer), or the file system you are using, or your LAN • The URL itself may be wrong • A router or network link may have failed and the Internet may not yet have rerouted around the problem

  8. E-Trade computers crash again -- and again The computer system of online security firm E-Trade crashed on Friday for the third consecutive day. "It was just a software glitch. I think we were all frustrated by it," says an E-Trade executive. Industry analyst James Mark of Deutsche Bank commented “…it's the application on a large scale. As soon as E-Trade's volumes started spiking up, they had the same problems as others…." Edupage Editors <edupage@franklin.oit.unc.edu> Sun, 07 Feb 1999 10:28:30 -0500

  9. Reliable Distributed Computing:Increasingly urgent, yet unsolved • Distributed computing has swept the world • Impact has become revolutionary • Vast wave of applications migrating to networks • Already as critical a national infrastructure as water, electricity, or telephones • Yet distributed systems remain • Unreliable, prone to inexplicable outages • Insecure, easily attacked • Difficult (and costly) to program, bug-prone

  10. A National Imperative • Potential for catastrophe cited by • Presidential Commission on Critical Infrastructure Protection (PCCIP) • National Academy of Sciences Study on Trust in Cyberspace • These experts warn that we need a quantum improvement in technologies • Meanwhile, your e-commerce venture is at grave risk of stumbling – just like many others

  11. A Business Imperative

  12. A Business Imperative

  13. E-projects Often Fail • e-commerce revolves around computing • Even business and marketing people are at the mercy of these systems • When your company’s computing systems aren’t running, you’re out of business

  14. Big and Little Pictures • It is too easy to understand “reliability” as a narrow technical issue • In fact, many systems and companies stumble by building • Unreliable technologies, because of • A mixture of poor management and poor technical judgment • Reliable systems demand a balance between good management and good technology

  15. A Glimpse of Some “Unreliable Systems” • Quick review of some failed projects • These were characterized by poor reliability of the final product • But the issues were not really technical • As future managers you need to understand this phenomenon!

  16. Tales from the Software Crypt Source: Jerry Saltzer, Keynote address, SOSP 1999

  17. Tales from the Software Crypt Source: Jerry Saltzer, Keynote address, SOSP 1999

  18. 1995 Standish Group Study On timeOn budgetOn function Over budgetMissed scheduleLacks functions “Success” 20% “Challenged” 50% “Impaired” 30% Scrapped 2x budget2x competion time2/3 planned functionality Source: Jerry Saltzer, Keynote address, SOSP 1999

  19. A strange picture • Many technology projects fail • For lots of reasons • But some succeed • Today we do web-based hotel reservations all the time, yet “Confirm” failed • French air traffic project was a success yet US project lost $6 billion • Is there a pattern?

  20. Recurring Problems • Incommensurate scaling • Too many ideas • Mythical man-month • Bad ideas included • Modularity is hard • Bad-news diode • Best people are far more productive than average employees • New is better, not-even-available yet is best • Magic bullet syndrome Source: Jerry Saltzer, Keynote address, SOSP 1999

  21. 1995 Study of Tandem Computer Systems • 77% of failures that are software problems. • Software fault-tolerance techniques can overcome about 75% of detected faults. • Loose coupling between primary and backup is important for software fault tolerance. • Over two-thirds (72%) of measured software failures are recurrences of previously reported faults. Source: Jerry Saltzer, Keynote address, SOSP 1999

  22. A Buggy Aside • Q: What are the two main categories of software bugs called? • A: Bohrbugs and Heisenbugs • Q: Why?

  23. Bohr Model of Atom • Bohr argued that thenucleus was a little ball

  24. Bohr Model of Atom • Bohr argued that thenucleus was a little ball • Bohr bug is a nastybut well defined thing

  25. Bohr Model of Atom • Bohr argued that thenucleus was a little ball • Bohr bug is a nastybut well defined thing • Your technical peoplecan reproduce it, so theycan nail it

  26. Heisenbug • Heisenberg modeled atom as a cloud of electromsand a cloud-like nucleus • The closer you look, themore it wiggled • A Heisenbug moves when your people try and pin it down.They won’t find it easy to fix.

  27. Why? • Bohrbugs tend to be deterministic errors – outright mistakes in the code • Once you understand what triggers them they are easy to search for and fix • Heisenbugs are often delayed side-effects of an old error. Like a bad tank of gas, effect may happen long after the bug first “occurs”. Hard to fix because at the time the mistake happened, nothing obvious went wrong

  28. Why Systems fail • Mostly, because something crashes • Usually, software or a human error • Mean time to failure improves with age but software problems remain prevalent • Every kind of software system is prone to failures. Failure to plan for failures is the most common way for e-systems to fail.

  29. E-reliability • We want e-commerce solutions to be reliable… but what should this mean? • Fault-tolerant? • Secure? • Fast enough? • Accessible to customers? • Deliver critical services when needed, where needed, in a correct, timely manner

  30. Costs of a Failure

  31. Minimizing Downtime • Idea is to design critical parts of your system to survive failures • Two basic approaches • Recoverable systems are designed to restart without human intervention – but may wait until outage is repaired • Highly available systems are designed to keep running during failure

  32. Recoverability • The technology is called “transactions” • We’ll discuss this next time, but… • Main issue is time needed to restart the service • For a large database, half an hour or more is not at all unusual • Faster restart requires a “warm standby”

  33. High Availability • Idea is to have a way to keep the system running even while some parts are crashed • For example, a backup that takes over if primary fails • Backup is kept “warm” • This involves replicating information • As changes occur, backup may lag behind

  34. Complexity • The looming threat to your e-commerce solution, no matter what it may be • Even simple systems are hard to make reliable • Complex systems are almost impossible to make reliable • Yet innovative e-commerce projects often require fairly complex technologies!

  35. Two Side-by-Side Case Studies • American Advanced Automation System • Intended as replacement for air traffic control system • Needed because Pres. Reagan fired many controllers in 1981 • But project was a fiasco, lost $6B • French Phidias System • Similar goals, slightly less ambitious • But rolled out, on time and on budget, in 1999

  36. Background • Air traffic control systems are using 1970’s technology • Extremely costly to maintain and impossible to upgrade • Meanwhile, load on controllers is rising steadily • Can’t easily reduce load

  37. Air Traffic Control system (one site) Onboard Radar X.500 Directory Team of Controllers Air Traffic Database(flight plans, etc)

  38. Politics • Government wanted to upgrade the whole thing, solve a nagging problem • Controllers demanded various simplifications and powerful new tools • Everyone assumed that what you use at home can be adapted to the demands of an air traffic control center

  39. Technology • IBM bid the project, proposed to use its own workstations • These aren’t super reliable, so they proposed to adapt a new approach to “fault-tolerance” • Idea is to plan for failure • Detect failures when they occur • Automatically switch to backups

  40. Core Technical Issue? • Problem revolves around high availability • Waiting for restart not seen as an option: goal is 10sec downtime in 10 years • So IBM proposed a replication scheme much like the “load balancing” approach • IBM had primary and backup simply do the same work, keeping them in the same state

  41. Technology findtracks Identifyflight Lookuprecord Planactions Humanaction radar Conceptual flow of system IBM’s fault-tolerant process pair concept findtracks Identifyflight Lookuprecord Planactions Humanaction radar findtracks Identifyflight Lookuprecord Planactions Humanaction radar

  42. Why is this Hard? • The system has many “real-time” constraints on it • Actions need to occur promptly • Even if something fails, we want the human controller to continue to see updates • IBM’s technology • Based on a research paper by Flaviu Cristian • But had never been used except for proof of concept purposes, on a small scale in the laboratory

  43. Politics • IBM’s proposal sounded good… • … and they were the second lowest bidder • … and they had the most aggressive schedule • So the FAA selected them over alternatives • IBM took on the whole thing all at once

  44. Disaster Strikes • Immediate confusion: all parts of the system seemed interdependent • To design part A I need to know how part B, also being designed, will work • Controllers didn’t like early proposals and insisted on major changes to design • Fault-tolerance idea was one of the reasons IBM was picked, but made the system so complex that it went on the back burner

  45. Summary of Simplifications • Focus on some core components • Postpone worry about fault-tolerance until later • Try and build a simple version that can be fleshed out later … but the simplification wasn’t enough. Too many players kept intruding with requirements

  46. Crash and Burn • The technical guys saw it coming • Probably as early as one year into the effort • But they kept it secret (“bad news diode”) • Anyhow, management wasn’t listening (“they’ve heard it all before – whining engineers!”) • The fault-tolerance scheme didn’t work • Many technical issues unresolved • The FAA kept out of the technical issues • But a mixture of changing specifications and serious technical issues were at the root of the problems

  47. What came out? • In the USA, nothing. • The entire system was useless – the technology was of an all-or-nothing style and nothing was ready to deploy • British later rolled out a very limited version of a similar technology, late, with many bugs, but it does work…

  48. Contrast with French • They took a very incremental approach • Early design sought to cut back as much as possible • If it isn’t “mandatory” don’t do it yet • Focus was on console cluster architecture and fault-tolerance • They insisted on using off-the-shelf technology

  49. Contrast with French • Managers intervened in technology choices • For example, the vendor wanted to do a home-brew fault-tolerance technology • French insisted on a specific existing technology and refused to bid out the work until vendors accepted • A critical “good call” as it worked out

  50. Learning by Doing • To gain experience with technology • They tested, and tested, and tested • Designed simple prototypes and played with them • Discovered that large cluster would perform poorly • But found a “sweet spot” and worked within it • This forced project to cut back on some goals

More Related