400 likes | 614 Views
Systems Thinking 2. Systems Engineering and Socio- technical Systems. Content. Systems engineering Socio-technical systems Failure examples System testing: bugs, performance T5 Example Systems failures and recovery. Systems Engineering.
E N D
Systems Thinking 2 Systems Engineering and Socio-technical Systems. Partial source: John Rooksby, University of St Andrews
Content • Systems engineering • Socio-technical systems • Failure examples • System testing: bugs, performance • T5 Example • Systems failures and recovery
Systems Engineering • How systems should be planned, designed, implemented, built, and maintained. • Need to identify and manipulate the properties of the system as a whole. • May not be straightforward to do, even when we know the component properties. • If system has complicated collective behaviour , then a simplistic, modular approach may fail. • Projects too big for `agile’ methods. Don’t have the control or resource for top-down planning or waterfall management. • So how to do it?
Remember the Apollo Program (1960-70s) discussed in the last lecture.. • A huge triumph for engineering at scale. • Highest quality. • Still looks like that now.
Software and Computer Systems • Modern software systems massively complex. • Lots of systems engineering ideas have been adapted to software engineering(and vice-versa). • Lots of belief in modular development of code, interfaces of components, modular modelling, UML, … • To some people in CS, `systems’ means the hardware + maybe the OS • Thatis an example • Often interested in more general problems.
Names, names, names • Large scale IT systems: • MS excel: several million lines of code; many programmers but one coherent owner; clear environment. • iPhone: similar comments, many programs, also hardware. • Large scale complex IT systems • Large scale IT systems, but unpredictable behaviour, multiple stakeholders, complex (dynamic, unpredictable) environment. • Socio-technical systems • Interaction between IT systems and society. • Systems of systems, Ultra-large scale systems
Large-scale Systems • Networks of computers put to ever more complex uses at ever larger scales, with ever larger user bases. • Flaws become more frequent and more costly at scale, but we increasingly rely on large-scale systems. • Policymakers and managers find it harder to define what the requirements are. Designers and developers find it harder to engineer, and learn. Users, managers, engineers find it harder to say whether the system is performing well or not. • How can a management feedback cycle (e.g. Deming - plan, do, check, act) work here?
Socio-technical Systems • Many stakeholders with many different goals. • Medical records systems. • No single owner of system components. • System often evolved, not designed as a whole. • supply chains • ``An audit at BMW showed that they used 4500 separate programs/systems in their business’’ • (I. Sommerville, L1-IntroToLSCITS.pptx) • System supports multiple, ongoing tasks - no single problem that system is solving. • Which are critical to me? • What depends upon what? • When/how to do maintenance, or upgrade? • What to protect?
Software-intensive system Socio-technical systems Social and political environment Laws, regulations, custom & practice System users Business processes Organizational policies and culture Organizational strategies and goals Adapted from slides LSCITS and Socio-technical Systems, I Sommerville
Failures • Software horror stories (older) http://www.cs.tau.ac.il/~nachumd/horror.html • LOTS of more modern examples.
``An example of poor development practices causing a system failure can be found in the experience of the Pentagon’s National Reconnaissance Office (NRO). The inadequate testing of the delivery system of Titan IV rocket. Two Titan rockets were lost, meaning that expensive military equipment necessary to the U.S. Governments defence program (namely early warning satellites) were unable to be deployed. The head of the N.R.O. has attributed this error to “a misplaced decimal point” in software, which controlled the rocket.’’
Electronics startup transient kills spacecraft craig@deforest.orgTue, 29 Jun 1999 10:27:17 -0700 http://catless.ncl.ac.uk/Risks/20.47.html#subj1 A recent (25 Jun 1999) press release from NASA HQ identifies the untimely demise of the Wide Field Infrared Explorer (WIRE) spacecraft as due to a design flaw in the pyro control logic board. WIRE's detector was to be cooled with a block of solid hydrogen. The telescope cover's explosive release mechanism fired immediately when the instrument was powered up, exposing the detector to direct sunlight and sublimating all of the solid hydrogen on board within 48 hours of its 5-Mar-99 launch. Unbalanced thrust from the hydrogen venting gave the spacecraft an uncontrolled 60 RPM spin.
Probe after systems failure on Tube (UKPA) –(9th Nov. 2011) • Thousands of Tube passengers were stuck on trains which stalled in tunnels after a "total communications and systems failure" on a London Underground (LU) line, a union has revealed. • The Rail Maritime and Transport (RMT) union said a "major emergency" affected the Jubilee Line after screens in the control room went blank for almost an hour during the evening rush hour last Friday. • LU said an investigation was under way after a software problem, but stressed that at "no point" were passengers at risk, as the system prevented trains from moving into close proximity of each other. • The RMT obtained a copy of the control log for last Friday evening, which reported a "total loss of visuals" at 1853 on the Jubilee Line. • The log read: "At the same time, Desk 5 and Desk 6 signallers reported their servers were also getting slower and slower, until both confirmed no control at 1855. • "From 1856, all signalling control sites were lost, and all detail on over view diagram was lost as well. Instructions were issued for line to be suspended, and Code Amber was issued via train radio accordingly. All parties updated, and line effectively suspended end to end." • The log said limited signalling control was restored, but 10 trains were "stalled" and had to be emptied of passengers. Over an hour later, empty trains were run to test the system and services resumed at 1935. • RMT General Secretary Bob Crow said: "This major emergency last Friday evening exposes the lethal consequences of removing the drivers from the trains. The Jubilee Line is already heavily automated but this incident shows that you still need drivers to move into manual mode and take over when something goes wrong." • Howard Collins, chief operating officer at London Underground, said: "These claims are without foundation. Driverless trains have been in operation across the world for decades, including on the DLR, one of the most efficient railways anywhere in Europe. Automated train operation has been used on the Victoria line for the last 40 years, and on the Central line since the 1990s. • "We apologise to passengers affected by disruption on the Jubilee line on Friday evening. This was caused by a software problem, and a thorough investigation is under way. We are pressing our contractors Thales to provide assurances that this will not happen again. At no point were passengers at risk, as the system prevented trains from moving into close proximity of each other."
Flash Crash • http://en.wikipedia.org/wiki/2010_Flash_Crash • Stock market crash • Systemic risk from apparent correlated behaviour of high-frequency trading platforms.
London Ambulance Service • Introduced a new computer-aided despatch system in 1992 which was intended to automate the system that despatched ambulances in response to calls from the public and the emergency services. • 2500 calls per day, 1500 emergency. • Could not cope with load. Response time several hours. • 3 weeks after its introduction, it failed completely. • Chaos, ambulances redirected. • Estimates that up to 30 people may have died as a consequence. • LAS eventually reverted to the previous manual system. • The systems failure was not just due to technical issues but to a failure to consider human and organisational factors in the design of the system. • Errors were made in the procurement, design, implementation, and introduction of the system.
System failure? (Guardian, 9th July 2009) • The £12.7bn NHS computer programme is five years behind schedule and beset by criticism, viruses and fears over patient privacy. So should the world's biggest IT project be scrapped? • At some point last November, an infection began to spread unnoticed through the three hospitals that make up Barts and The London NHS Trust in east London. This was not MRSA but the Mytob worm, a common but potent computer virus. It steadily slowed and choked the 4,700 PCs of the trust's network. By noon on 17 November, a Monday, the network was effectively crippled. • The following day, the trust declared an "internal major incident". Ambulances carrying accident and emergency patients were diverted to other hospitals. Operations were postponed. The appointments system was suspended. Access to clinical information - usually quick and electronic - was maintained only by the slowest and most old-fashioned of methods: "runners" drafted in from the trust's administrative departments pounded the hospitals' endless twisting corridors with paper notes and printouts. • Scores of computer technicians from the private sector and from other London NHS trusts were brought in to eradicate the virus, but the PCs had to be decontaminated one by one. It was a week before the crisis was officially declared over, and a fortnight before the hospitals, some of the busiest in the capital, returned to normal. Afterwards, an official report found the virus had been able to infiltrate them because their anti-virus software "did not reach all [their] PCs and ... was configured incorrectly on some". The whole episode, the report concluded, had been "entirely avoidable". …
IBM takes blame for massive bank system failure By Sumner Lemon, July 13, 2010 07:33 AM ET, www.computerworld.com • IBM took responsibility for a major IT system failure suffered by one of Singapore's largest banks on July 5, saying an employee's error caused the outage. • In a statement released Tuesday, IBM said problems started when software monitoring tools detected "instability" within DBS Bank's storage system. While the storage system remained "fully functional," IBM employees initiated a recovery process to fix the issue. • "Unfortunately, a failure to apply the correct procedure inadvertently caused the service outage," IBM said, adding that no data was lost. • The outage knocked DBS' IT systems offline for seven hours, leaving customers unable to withdraw money from automatic teller machines. All of the bank's commercial and consumer banking systems were affected, although no data was lost, the bank said at the time. • Much of DBS' IT systems are managed by IBM under a S$1.2 billion [B] (US$868 million [M]) outsourcing agreement.
2-Oct-2011 • The Colombo Stock Exchange (CSE), whose systems crashed on September 19 is awaiting a comprehensive analysis from Millennium IT (MIT), its service provider, on the flaw while gearing to invest in a sophisticated system, according to CSE sources. • “We expect the MIT report this week and plan to upgrade and invest in our system accordingly to provide an uninterrupted service,” a source told the Business Times, adding that this is the first time that such a hardware failureoccurred in the database servers shortly after trading opened. Trading was then halted for the rest of the day and all earlier trades cancelled.
Lack of Code of Connections • In Public Private Partnerships (PPPs) to design and deploy EPR systems the private supplier requires code of connections approval in order to enable access to the NHS network, through which the networks of individual hospitals can be gained. This technical security clearance is necessary for off-site access to the Trust’s networks and systems. In the case at Preston, code of connections approval was overlooked during initial project planning and preparation. With the private supplier being US based, and build going on simultaneously at two sites the lack of code of connections during the first six or so months of database build and configuration, work was hampered by the inability of paired US and UK analysts to share real-time up to date details of the system. Misunderstandings about the current configuration of the database delayed the project. This problem clearly stemmed from a lack of initial understanding about what a PPP for designing an EPR would entail concerning access to the network infrastructure, and the requirements for off to on-site collaboration. Code of connections should be approved early on in the project. http://archive.cs.st-andrews.ac.uk/STSE-Handbook/Other/warstories.html
Birth is neither an Accident nor an Emergency • A very pregnant woman was admitted to A&E following an accident. Probably induced by the trauma, she gave birth in the A&E ward. This gave the computer the following problems (i) births could be registered only in the maternity ward. No other wards were given access to this facility. (ii) only patients admitted to A&E could be discharged or transferred from A&E; this therefore excluded the baby. The obvious solution, to arrange a virtual transfer for the mum to maternity from A&E so that she could give a virtual birth in maternity followed by a virtual transfer of mum and baby from maternity back to A&E was denied by the computer on the grounds that there were no spare beds for new admissions to maternity that night (true, as it happens). (Fortunately, a bed became available in maternity the next day, after mum was well enough to be moved out of A&E). 0The bed manager in maternity had to increase the number of available beds by one so that the virtual transfer for the purposes of giving birth could occur, then decrease the number of beds by one. Of course, both of these were reported by the computer to the director of clinical resources the next day. http://archive.cs.st-andrews.ac.uk/STSE-Handbook/Other/warstories.html
Systems Testing • Failure to test adequately causes many operational system failures • Done properly, huge proportion of total project effort − at least 50% • Should not be done just at the end − lots of dependencies along the way; can be too late to fix even minor bugs • Should be done throughout the project
Analysis, test, quality • Testing should be just one component of the overall quality-control process: • Importance of specification, and planning for change of specification • Significant role for modelling and simulation • Critical review (by experienced peers) • Inspection regimes • Training! • Note the extent to which very many different groups of people are involved: designers, architects, developers, dedicated testers, users, managers, consultants
Load (or Performance) Testing • Basic question: will the system work under a range of possible loads (e.g., numbers of people using it)? • Typically, the experimental technique employed is simulation • Also a role for testing: once it’s built, test at varying loads, and make any adjustments that are needed and possible
Testing socio-economic • So, systems testing faces many socio-economic-technical issues • Most of these issues are about inter-operation of components systems − people, process, and technology all implicated • Testing, at least for large and complex systems, a team activity • Occur at all phases − R&D, design, specification, construction, testing, commissioning, operation. Even highly regarded organizations can mess up: Apple, Toyota, MoD, … .
Heathrow Terminal 5 • Opened 27 March 2008 • Very complex composite system; years of planning and construction, well-funded, no serious space constraints, lots of well-oiled experience; IT not radically new • T5 a 4.3 B £ project, largest building project in western Europe, 176 x 396 x 40 m, 7500 construction workers, 16 major projects, 147 sub-projects • 96 check-in desks, 140 service desks, 90 bag-drops
Success, project risk • Mostly a very successful project, although building was finished late; new road, new tunnel, diversion of two rivers • BAA strategy: accept all the risk, in order to encourage contractors to concentrate on problem-solving
BA, Baggage Failure • BA, the airline concerned, sole tenant of BAA • Most of BA’s short-haul ops moved to T5 on its first day, around 400 flights and 34,000 passengers • Major failure of the baggage handling system • BA apparently slow to realize extent of problem.
Computer Weekly, 27 March 14:00 The opening of Heathrow Airport's Terminal 5 has been overshadowed by technical problems with the baggage system. A spokeswoman for airport owner BAA said British Airways ground-handling staff were experiencing problems with logging onto the system, resulting in delays for arriving passengers, who were left waiting for their bags. The terminal opened at 4am today after being beset by problems yesterday, when BAA was forced to temporarily pull its biometrics fingerprinting system after the Information Commissioner raised concerns about its data protection implications. BAA said it was working to correct the problem. "The problems with the baggage system affected a small number of flights this morning," said a spokeswoman. "Things are performing better now. There's been an impact on some services, but it will hopefully be resolved. We are waiting to see how things pan out." British Airways hopes the new baggage system will improve its track record on baggage handling - currently the worst in Europe. BAA designed the system, which uses barcodes to track bags as they are moved around the airport. The system was designed with Dutch company Vanderlande and IBM. The IBM software works out where the bags are supposed to be going, and logistics software works out the best way to get there.
Baggage, Delays, Cancellations • So, a problem with the baggage handling system • More bags arrived than left: system clogged up − by end of opening week-end, 28,000 in temporary storage (BA say half that) • Inbound passengers delayed up to 4 hrs • 500 flights cancelled, more delayed • Schedule not fully recovered until 8 April (11 days)
BA’s response • Drafted in 400 volunteer staff • Baggage handling contractors back in • Bags sent to Italy for sorting • don’t ask, I don’t know why Italy • Cost to BA around £25 M • Reputation damage probably larger.
Contributing Factors (from Rooksby, again) • Testing • Failed to test baggage system at high enough loading • Compromises upon testing programme • this was a calculated risk • Cascade of system failures • Mis-configuration of data feed between baggage handling and reconciliation; other IT failures • Training • Training supervisors absent from some key areas on opening day • BA staff not trained to drive the jetways • Other training gaps
Contributing Factors (from Rooksby, again) • Building • Construction late • 10% of lifts not working • Poor communications between BA and BAA, no crisis plan • Planning/staffing • Contingency plans not available to the right people • Staff delayed at security (50% more people than ‘expected’ • Staff system log in problems; other IT problems • IT • Data transmission errors related to reconciliation • Insufficient server capacity!
Some information • Largest baggage handling system in Europe, designed to handle 70,000 items per day • Designed by BAA, Vanderlande, and IBM • Makes use of RFID/barcodes • Not everywhere does as yet … • Needs enough space for the kit
Outline Development Plan • Construction of building, with utilities • Install physical systems: scanners, cranes, conveyer belts, etc • Install electrics • Install system controls (computer systems) • Configure and test controls • Systems integration • Testing and into operation What do you think about the testing regime here?
Symptoms/Causes of Systems Failure • Technology• Coping with Technology • Requirements • Insufficient Resources • Malicious (Technical) • Malicious (Industrial) • Internal politics / Inertia • Government policy• Grandiosity Source (modified): SYSTEMS FAILURES: An approach to understanding what can go wrong John Donaldson & John Jenkins Middlesex University John.Donaldson@dial.pipex.comJ.Jenkins@mdx.ac.uk Presented at European Software Day of EuroMicr'00, Maastricht, NL, September 2000.Published in Proceedings of the European Software Day of EuroMicr'00,, 2000, ISBN 0-7695-0872-4
Effects of Systems Failure • Financial loss • Depletion of assets and closure • Job losses • Lowering of moral ⇒ performance down • Loss of shareholder confidence • Bad press/media publicity ... • Civil and criminal lawsuits against company, executives. Source (modified): SYSTEMS FAILURES: An approach to understanding what can go wrong John Donaldson & John Jenkins Middlesex University John.Donaldson@dial.pipex.com J.Jenkins@mdx.ac.uk Presented at European Software Day of EuroMicr'00, Maastricht, NL, September 2000.Published in Proceedings of the European Software Day of EuroMicr'00,, 2000, ISBN 0-7695-0872-4
Failures and Recovery • If can’t completely design system, and can’t control its evolution, and its parts, their interactions and environment are changing then it is very difficult (if not impossible) to prevent failures. • A single stakeholder has to anticipate failures and: • try to figure-out what is most critical to it • try to understand what the threats and vulnerabilities are • including for other parties it depends upon (!) • design, deploy, maintain, the components it controls for recovery • share information with appropriate other parties • invest in business continuity, disaster recovery, etc. • Methodologies needed!
Uncertainty and Imperfect Control: Security Example • Some people say that vulnerabilities are just `software flaws’, and argue that vendors should always be responsible. BUT • Developer/vendor can’t possibly anticipate all circumstances in which will be used. • Environment keeps changing. • In practice, baddies can always createnew attack `vectors’ • Baddies are like opponents in a game: both sides evolve and anticipate each other. • Developer can’t anticipate every possible move by opponents. • [My opinion: In reality, developers and vendors should only be liable where they have been negligent in dealing with known vulnerabilities and types of such, doing reasonable threat analysis, or communicating limitations.]
Situational Awareness ``[T]here are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don't know we don't know.’’ • Former United States Secretary of Defense Donald Rumsfeld • ``A subject is certain of something when he knows that thing; he is uncertain when he does not know it, but he knows he does not: he is consciously uncertain. On the other hand, he is unaware of something when he does not know it, and he does not know he does not know, and so on ad infinitum: he does not perceive, does not have in mind, the object of knowledge. The opposite of unawareness is awareness.’’ • Awareness and Partitional Information Structures, Salvatore Modica; Aldo Rustichini, Theory and Decision37 (1)
Some sources • Roger Derksen, Huub van der Wouden, Paul Heath (2007) Testing The Heathrow Terminal 5 Baggage Handling System – Before it is Built. Paper Presented at Eurostar 2007, Manchester UK. • House of Commons Transport Select Committee (2008) The Opening of Heathrow Terminal 5. Published by The Stationary Office LTD. • Ian Sommerville, Software Engineering, 7th Edition • Computer Weekly, 27 March 2008 • John Rooksby, University of St Andrews, notes on T5. • Slides: http://www.software-engin.com/teaching/systems-engineering-for-lscits