580 likes | 738 Views
the e-risks of e-commerce. Professor Ken Birman Dept. of Computer Science Cornell University. Reliability. If it stops ticking when it takes a licking… your e-commerce company could tank So you need to know that your technology base is reliable
E N D
the e-risks of e-commerce Professor Ken Birman Dept. of Computer Science Cornell University
Reliability • If it stops ticking when it takes a licking… your e-commerce company could tank • So you need to know that your technology base is reliable • It does what it should do, does it when needed, does it correctly, and is accessible to your customers.
A Quiz • Q: When and why did Sun Microsystems have a losing quarter? Ken Birman: Mr. Birman,Sun experienced a loss in Q4FY89 (June 1989). This was the quarter in which we transitioned to a new manufacturing, order processing and inventory control systems.Andrew CaseyManager, Investor RelationsSun Microsystems, Inc.(650) 336-0761andrew.casey@corp.sun.com
Typical Web Session get http://www.cs.cornell.edu/People/ken where what firewall
DNS root DNS node DNS node DNS leaf DNS root DNS leaf DNS root DNS leaf caching proxy caching proxy web server web server Typical Web Session resolve “www.cs.cornell.edu” IP address128.64.31.77 load-balancing proxy firewall
The Web’s dark side Netscape error: web server www.cs.cornell.edu... not responding. Server may have crashed or is overloaded. OK
Right URL, but the request times out. Why? • The web server could be down • Your network connection may have failed • There could be a problem in the “DNS” • There could be a network routing problem • The Internet may be experiencing an overload • Your web caching proxy may be down • Your PC might have a problem, or your version of Netscape (or Explorer), or the file system you are using, or your LAN • The URL itself may be wrong • A router or network link may have failed and the Internet may not yet have rerouted around the problem
E-Trade computers crash again -- and again The computer system of online security firm E-Trade crashed on Friday for the third consecutive day. "It was just a software glitch. I think we were all frustrated by it," says an E-Trade executive. Industry analyst James Mark of Deutsche Bank commented “…it's the application on a large scale. As soon as E-Trade's volumes started spiking up, they had the same problems as others…." Edupage Editors <edupage@franklin.oit.unc.edu> Sun, 07 Feb 1999 10:28:30 -0500
Reliable Distributed Computing:Increasingly urgent, yet unsolved • Distributed computing has swept the world • Impact has become revolutionary • Vast wave of applications migrating to networks • Already as critical a national infrastructure as water, electricity, or telephones • Yet distributed systems remain • Unreliable, prone to inexplicable outages • Insecure, easily attacked • Difficult (and costly) to program, bug-prone
A National Imperative • Potential for catastrophe cited by • Presidential Commission on Critical Infrastructure Protection (PCCIP) • National Academy of Sciences Study on Trust in Cyberspace • These experts warn that we need a quantum improvement in technologies • Meanwhile, your e-commerce venture is at grave risk of stumbling – just like many others
E-projects Often Fail • e-commerce revolves around computing • Even business and marketing people are at the mercy of these systems • When your company’s computing systems aren’t running, you’re out of business
Big and Little Pictures • It is too easy to understand “reliability” as a narrow technical issue • In fact, many systems and companies stumble by building • Unreliable technologies, because of • A mixture of poor management and poor technical judgment • Reliable systems demand a balance between good management and good technology
A Glimpse of Some “Unreliable Systems” • Quick review of some failed projects • These were characterized by poor reliability of the final product • But the issues were not really technical • As future managers you need to understand this phenomenon!
Tales from the Software Crypt Source: Jerry Saltzer, Keynote address, SOSP 1999
Tales from the Software Crypt Source: Jerry Saltzer, Keynote address, SOSP 1999
1995 Standish Group Study On timeOn budgetOn function Over budgetMissed scheduleLacks functions “Success” 20% “Challenged” 50% “Impaired” 30% Scrapped 2x budget2x competion time2/3 planned functionality Source: Jerry Saltzer, Keynote address, SOSP 1999
A strange picture • Many technology projects fail • For lots of reasons • But some succeed • Today we do web-based hotel reservations all the time, yet “Confirm” failed • French air traffic project was a success yet US project lost $6 billion • Is there a pattern?
Recurring Problems • Incommensurate scaling • Too many ideas • Mythical man-month • Bad ideas included • Modularity is hard • Bad-news diode • Best people are far more productive than average employees • New is better, not-even-available yet is best • Magic bullet syndrome Source: Jerry Saltzer, Keynote address, SOSP 1999
1995 Study of Tandem Computer Systems • 77% of failures that are software problems. • Software fault-tolerance techniques can overcome about 75% of detected faults. • Loose coupling between primary and backup is important for software fault tolerance. • Over two-thirds (72%) of measured software failures are recurrences of previously reported faults. Source: Jerry Saltzer, Keynote address, SOSP 1999
A Buggy Aside • Q: What are the two main categories of software bugs called? • A: Bohrbugs and Heisenbugs • Q: Why?
Bohr Model of Atom • Bohr argued that thenucleus was a little ball
Bohr Model of Atom • Bohr argued that thenucleus was a little ball • Bohr bug is a nastybut well defined thing
Bohr Model of Atom • Bohr argued that thenucleus was a little ball • Bohr bug is a nastybut well defined thing • Your technical peoplecan reproduce it, so theycan nail it
Heisenbug • Heisenberg modeled atom as a cloud of electromsand a cloud-like nucleus • The closer you look, themore it wiggled • A Heisenbug moves when your people try and pin it down.They won’t find it easy to fix.
Why? • Bohrbugs tend to be deterministic errors – outright mistakes in the code • Once you understand what triggers them they are easy to search for and fix • Heisenbugs are often delayed side-effects of an old error. Like a bad tank of gas, effect may happen long after the bug first “occurs”. Hard to fix because at the time the mistake happened, nothing obvious went wrong
Why Systems fail • Mostly, because something crashes • Usually, software or a human error • Mean time to failure improves with age but software problems remain prevalent • Every kind of software system is prone to failures. Failure to plan for failures is the most common way for e-systems to fail.
E-reliability • We want e-commerce solutions to be reliable… but what should this mean? • Fault-tolerant? • Secure? • Fast enough? • Accessible to customers? • Deliver critical services when needed, where needed, in a correct, timely manner
Minimizing Downtime • Idea is to design critical parts of your system to survive failures • Two basic approaches • Recoverable systems are designed to restart without human intervention – but may wait until outage is repaired • Highly available systems are designed to keep running during failure
Recoverability • The technology is called “transactions” • We’ll discuss this next time, but… • Main issue is time needed to restart the service • For a large database, half an hour or more is not at all unusual • Faster restart requires a “warm standby”
High Availability • Idea is to have a way to keep the system running even while some parts are crashed • For example, a backup that takes over if primary fails • Backup is kept “warm” • This involves replicating information • As changes occur, backup may lag behind
Complexity • The looming threat to your e-commerce solution, no matter what it may be • Even simple systems are hard to make reliable • Complex systems are almost impossible to make reliable • Yet innovative e-commerce projects often require fairly complex technologies!
Two Side-by-Side Case Studies • American Advanced Automation System • Intended as replacement for air traffic control system • Needed because Pres. Reagan fired many controllers in 1981 • But project was a fiasco, lost $6B • French Phidias System • Similar goals, slightly less ambitious • But rolled out, on time and on budget, in 1999
Background • Air traffic control systems are using 1970’s technology • Extremely costly to maintain and impossible to upgrade • Meanwhile, load on controllers is rising steadily • Can’t easily reduce load
Air Traffic Control system (one site) Onboard Radar X.500 Directory Team of Controllers Air Traffic Database(flight plans, etc)
Politics • Government wanted to upgrade the whole thing, solve a nagging problem • Controllers demanded various simplifications and powerful new tools • Everyone assumed that what you use at home can be adapted to the demands of an air traffic control center
Technology • IBM bid the project, proposed to use its own workstations • These aren’t super reliable, so they proposed to adapt a new approach to “fault-tolerance” • Idea is to plan for failure • Detect failures when they occur • Automatically switch to backups
Core Technical Issue? • Problem revolves around high availability • Waiting for restart not seen as an option: goal is 10sec downtime in 10 years • So IBM proposed a replication scheme much like the “load balancing” approach • IBM had primary and backup simply do the same work, keeping them in the same state
Technology findtracks Identifyflight Lookuprecord Planactions Humanaction radar Conceptual flow of system IBM’s fault-tolerant process pair concept findtracks Identifyflight Lookuprecord Planactions Humanaction radar findtracks Identifyflight Lookuprecord Planactions Humanaction radar
Why is this Hard? • The system has many “real-time” constraints on it • Actions need to occur promptly • Even if something fails, we want the human controller to continue to see updates • IBM’s technology • Based on a research paper by Flaviu Cristian • But had never been used except for proof of concept purposes, on a small scale in the laboratory
Politics • IBM’s proposal sounded good… • … and they were the second lowest bidder • … and they had the most aggressive schedule • So the FAA selected them over alternatives • IBM took on the whole thing all at once
Disaster Strikes • Immediate confusion: all parts of the system seemed interdependent • To design part A I need to know how part B, also being designed, will work • Controllers didn’t like early proposals and insisted on major changes to design • Fault-tolerance idea was one of the reasons IBM was picked, but made the system so complex that it went on the back burner
Summary of Simplifications • Focus on some core components • Postpone worry about fault-tolerance until later • Try and build a simple version that can be fleshed out later … but the simplification wasn’t enough. Too many players kept intruding with requirements
Crash and Burn • The technical guys saw it coming • Probably as early as one year into the effort • But they kept it secret (“bad news diode”) • Anyhow, management wasn’t listening (“they’ve heard it all before – whining engineers!”) • The fault-tolerance scheme didn’t work • Many technical issues unresolved • The FAA kept out of the technical issues • But a mixture of changing specifications and serious technical issues were at the root of the problems
What came out? • In the USA, nothing. • The entire system was useless – the technology was of an all-or-nothing style and nothing was ready to deploy • British later rolled out a very limited version of a similar technology, late, with many bugs, but it does work…
Contrast with French • They took a very incremental approach • Early design sought to cut back as much as possible • If it isn’t “mandatory” don’t do it yet • Focus was on console cluster architecture and fault-tolerance • They insisted on using off-the-shelf technology
Contrast with French • Managers intervened in technology choices • For example, the vendor wanted to do a home-brew fault-tolerance technology • French insisted on a specific existing technology and refused to bid out the work until vendors accepted • A critical “good call” as it worked out
Learning by Doing • To gain experience with technology • They tested, and tested, and tested • Designed simple prototypes and played with them • Discovered that large cluster would perform poorly • But found a “sweet spot” and worked within it • This forced project to cut back on some goals