Backward and forward looking at dependable and secure computing

Backward and forward looking at dependable and secure computing Yinghua Min Fellow of IEEE Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China At PRDC09, 2009/11/16

Outline • Historical review of dependable computing • FTCS • DSN • IFIP WG10.4 • PRDC • New challenges of dependable and secure computing • Old techniques facing new environments • Concentrated on practical problems, rather than conceptual games

FTCS • Established in 1970 • FTC for critical applications • Aviation • Spaceflight • Railway transportation • A highly academic symposium

Dependable computing • People understood that our area needed some extension. • A. Avizienis and Jean-Claude Laprieproposed the concept of Dependable Computing at FTCS-15 in 1985. • Human being is included in systems then. • Malicious faults • FTCS • DCCA DSN in 2000

DSN • Since 2000 • DSN has pioneered the fusion between security and dependability. • Understanding the need to simultaneously fight against cyber attacks, accidental faults, design errors, and unexpected operating conditions.

PRDC • 1989 Joint Symposium on Fault--Tolerant Computing, Chongqing, China, July 18-20, 1989 • 1991 Pacific Rim international symposium on FTS, Kawasaki, Japan • 1999 Pacific Rim international symposium on Dependable Computing, Hong Kong, China. • Keynote: Computer Crime in Hong Kong (Mr. Anthony Fung) • From the HK police department • Computer Crime and Internet Fraud • Its evidence for litigation support

Trusted Computing • Trusted Computing Platform Alliance (TCPA) in 1999 • TCG since 2003 • TPM → TCM (Trusted C Module) 2008 • Trusted root → security chip → trusted BIOS → trusted OS → trusted systems • Basically for PCs in the area of secure computing

IEEE Transactions on Dependable and Secure Computing • Since 2004 • Separate dependable computing from secure computing

System dependability • The system dependability situation has been getting worse rather than improving in recent years. Quoting the AMSD Roadmap, “the availability that was typically achievable by (wired) telecommunication services, and computer systems in the 1990s was 99.999 percent to 99.9 percent. Now cellular phone services, and web-based services, typically achieve an availability of only 99 per cent to 90 per cent” (AMSD Roadmap 2003, p. 31). The European Commission’s Accompanying Measure on System Dependability

New challenges • Three key requirements for computers • High performance • Low power • Dependability • Nano-ICs, more vulnerable • to transient (or soft) errors • to permanent malfunctions due to materials aging or wearout mechanisms. • Nano-scale IC reliability • Counterfeit ICs • Dependability and security in cloud computing • Signal integrality • Dependant software needs evidence.

Nano-scale IC reliability • The "International Technology Roadmap for Semiconductors" [SIA] estimates that by 2019 the feature size of process technology will reach 7nm, but only between 10% and 20% of chips will be defect free. • Power densities to skyrocket and on-chip temperatures to increase • Small delay defects, adjacentline coupling, crosstalk and process variation induced unreliability • variability-tolerant design • appropriate measures are taken, such as fault tolerance, redundancy, repair and reconfiguration.

Device lead condition shows parts were used Marking indicates an Op Amp from ADI… … but contains die for a Voltage Reference from PMI Evidence of prior marking for a part with inferior performance … Part number indicates a CLCC package, but this package is a CDP… … accompanied by bogus test report CounterfeitElectronic Components • These are incidents that jeopardize the performance and reliability of electronics.

Baofeng.com incident in China • Network outages in Jiangsu, Anhui, Guangxi, Henan, Gansu, and Zhejiang in China, May 19, 2009 • The network failure was led by the domain name system (DNS) failure of Baofeng.com, the website of the Chinese music player provider • The failure further caused the surge of DNS server visits and the decrease of processing performance of the network. • The servers of DNSPod were attacked by a malicious virus. • The incident was caused by a software fault or an attack?--- Maybe both

Bohrbugs and Mandelbugs • Bohrbugs • An unusual software bug that consistently makes its presence known under conditions that are either well-defined, possibly unknown or both. • Mandelbugs • A bug whose behavior doesn't appear malicious, but has such a high level of complexity that it appears when errors are accumulated for some time. • Bohrbugs behaving like Mandelbugs • Becoming an attack

Dependability in the Cloud • On April 26 2008, Amazon’s Elastic Cloud (EC2) had an outage • due to a single customer applying a very large set of unusual firewall rules • triggering a performance degradation bug in Amazon’s distributed firewall. • Availability and privacy are serious challenges for applications hosted on cloud infrastructure.

Challenges on cloud infrastructure • Cloud applications increase risk levels • Sharing of cloud resources by entities that engage in a wide range of behaviors and employ best practices to varying degrees • An environment with a few large cloud infrastructure providers • increases the risk of common mode outages affecting a large number of applications • provides highly visible targets for attackers. • Multiple administrative domains between the application and infrastructure operators reduces end-to-end system visibility and error propagation information, thus making problem detection and diagnosis very difficult. • A cloud provider's economies of scale allow levels of investment in redundancy and dependability, but smaller operators may not.

Old FTC techniques facing new environments • Checkpointing • Redundancy • Software fault-tolerance in middleware • ECC in mass storage systems • Fault detection and diagnosis in virtual machines • Assessment of dependability and security

Checkpointing for supercomputers • Periodic checkpointing → cooperative checkpointing • At runtime, the application requests acheckpoint. • The system grants or denies the checkpoint (to skip some of them) • based on various system-wide heuristics, including disk ornetwork usage and reliability information. • Usingcooperative checkpointing in one instance • reduced boundedslowdown by a factor of nine, • improved system utilization,and lost no more work to failures than periodic checkpointing • even when event prediction had a 90%false negative rate.

Noise-speculative Noise-speculative Noise-speculative Noise-verified Noise-verified Noise-verified Checkpointing at micro-operation level Committed state Committed state Processor State Violation Occurs Violation detected • Sliding window based on sensor delay • Delayed-commit: completed results buffered in the buffers until verified to be correct • Noise-speculative • Noise-verified • Rollback to a previous noise-verified state when a violation is detected 19

Redundancy • At the application level and at a hardware level. • Byzantine fault tolerance • Algorithms that are robust to arbitrary types of failures in distributed algorithms. • Do not require any centralized control that have some guarantee of always working correctly. • Data integrity • Redundancy in different places • RAID (redundant array of independent disks), a fault-tolerant storage device that uses data redundancy. • Synchronization is a big challenge.

Software fault-tolerance in middleware • Optimal fault tolerance strategy for both stateless and stateful Web services • Retry • Recovery block • N-version programming • Network characteristics: • Freedom • Dynamic • Multi-tier service • Debug performance problems of multi-tier services of black boxes.

Soft errors • Soft errors involve changes to data • Cosmic rays creating energetic neutrons and protons • The importance of soft errors increases as chip technology advances. • chip-level soft error • the radioactive atoms in the chip's material decay and release alpha particles into the chip. • Built-in Soft Error Resilience (BISER) Cell • system-level soft error • the data being processed is hit with a noise phenomenon

Transient Faults • Program replication • N-version programming • Time redundant technique, • Virtual duplex systems • Tandem Nonstop Cyclone is a custom system designed to use process replicas for transaction processing workloads. • Transient Fault Tolerance for Multi-core Architectures • Redundancy at the process level • Ensuring correct hardware execution or ensuring correct software execution

Assessment of dependability and security • The original definition of dependability is the ability to deliver service that can justifiably be trusted. • Justification • Evaluation • Banchmarking • Standardization • A dependability and security gap that is often perceived by users as a lack of trustworthiness in computer applications, and that is in fact undermining the network and service infrastructures that constitute the very core of the knowledge-based society.

Difficulties for assessment • The assessment of dependability in a standard and comparable way, considering all • Component failures • Software bugs • Human mistakes • Interaction mistakes • Malicious attacks • The quality of measurements • The assessment of dependability in component based, dynamic and adaptive systems and networks • The integration with the development process

Denial of service (DoS) • Effects of DoS attacks are experienced by users as a severe slowdown, service quality degradation, or service disruption. • We need accurate, quantitative, and versatile DoS impact metrics regardless of the underlying mechanism for service denial, attack dynamics, legitimate traffic mix, or network topology. • Measuring DoS through selected legitimate traffic parameters: • packet loss, • traffic throughput or goodput, • request/response delay, • transaction duration, and • allocation of resources.

Trustworthy computing Trusted computing Secure computing Robustness Survivability Adoptability Dependable computing Availability Maintainability Reliability Confident computing Controllability Cybersecurity Manageability Assurance Usability Integrity Safety Conceptual games

Concluding remarks • Dependable computing is a forever topic for information technology • Dependability is as important as high performance, and low power. • New challenges are coming with the advance of IT • The gap between academia and industry • Concentrate on practical problems, rather than conceptual games

Thank you for your attention!

Backward and forward looking at dependable and secure computing

Backward and forward looking at dependable and secure computing

Presentation Transcript

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Jets at the LHC: Looking Forward and Backward

Looking Backward, Guessing Forward

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

TDWG – Looking Backward and Forward

EEC 688/788 Secure and Dependable Computing