1 / 32

PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

PhD Course: Performance & Reliability Analysis of IP-based Communication Networks. Henrik Schiøler, Hans-Peter Schwefel, Mark Crovella, Søren Asmussen. Day 1 Basics & Simple Queueing Models(HPS) Day 2 Traffic Measurements and Traffic Models (MC)

kirkan
Download Presentation

PhD Course: Performance & Reliability Analysis of IP-based Communication Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PhD Course: Performance & Reliability Analysis of IP-based Communication Networks Henrik Schiøler, Hans-Peter Schwefel, Mark Crovella, Søren Asmussen • Day 1 Basics & Simple Queueing Models(HPS) • Day 2 Traffic Measurements and Traffic Models (MC) • Day 3 Advanced Queueing Models & Stochastic Control (HPS) • Day 4 Network Models and Application (HS) • Day 5 Simulation Techniques (HPS, SA) • Day 6 Reliability Aspects (HPS) Organized by HP Schwefel & H Schiøler

  2. Content • Basic concepts • Fault, failure, fault-tolerance, redundancy types • Availability, reliability, dependability • Life-times, hazard rates • Mathematical models • Availability models, Life-time models, Repair-Models • Performability • Availability and Network Management • Rolling upgrade, planned downtime, etc. • Methods & Protocols for fault-tolerant systems • SW Fault-Tolerance • Network Availability • IP Routing, HSRP/VRRP, SCTP • Server/Service Availability • Server Farms, Cluster solution, distributed reliability Demonstration: Fault-tolerant call control systems

  3. Motivation: Failure Types in Communication NWs WANs PSTNs • IP networks designed to be ‘robust’ but not highly available • End-to-end availability • Public Internet: below 99% (see e.g. http://www.netmon.com/IRI.asp). • PSTN: 99.94% • Operation, Administration, and Maintenance (OAM) one major source of outage times

  4. Data HA • backup and restore • regional redundancy • rolling data upgrade Startup & Shutdown • system startup • graceful shutdown Rolling Upgrade Supervision & Auditing Error Handling • rolling upgrade • patch procedure • migration • resource supervision • health/lifetime supervision • overload protection • detection and recovery • tracing and logging • escalation and alarming • operational modes Overview: Role of OAM OAM System High Availability • repair and replacement • Fault Management • HA control & verification • system diagnosis Network Element HA Network HA • network design • network redundancy /fail-over NE-Interface overcome outages keep the system stable • multi-homing • stacks and protocols • IP fail-over Continuous Execution • redundancy and diversity • state replication • message distribution Adapted from S. Uhle, Siemens ICM N

  5. OAM Concepts I: Startup/Upgrade/Shutdown • Startup • Put components into stable operational state • Caution: potential for high-load (synchronisation etc.) in start-up phase • Concept for start-up of larger sub-systems needed (e.g. Mutual dependence) • Upgrades • Rolling Upgrade • Allow upgrades of single components while its replicates are in operation • Consistency/data compatibility problems • Patch concept (SW) • Instead of full SW re-installation, incremental changes • Goal: zero or minimal outage of components • Graceful shutdown • Take components safely out of operation • Possible steps: • Stop accepting/redirect new tasks • Finish existing tasks • Synchronize data and isolate component • HW: Hot plug/swap ability • E.g. Interface cards • Life Testing

  6. OAM Concepts II:Monitoring/Supervision • Resource Supervision and Overload Protection • E.g. CPU load, queue-lengths, traffic volumes • Alarming  operator intervention, e.g. upgrades • Signalling to reduce overload at source • Graceful degradation • Logging & Tracing • For off-line analysis of incidents • Correlation of traces for system analysis • Problem: adequate granularity of logging data • Service concepts/contracts • Reaction to alarms • System recovery modes • Spare-parts handling • Qualified technicians, availability (24/7 ?) • High-availability of OAM system • Redundancy (frequently separate OAM network(s)) • Prioritisation of OAM traffic • Handling/Storing of OAM/logging data

  7. Content • Basic concepts • Fault, failure, fault-tolerance, redundancy types • Availability, reliability, dependability • Life-times, hazard rates • Mathematical models • Availability models, Life-time models, Repair-Models • Performability • Availability and Network Management • Rolling upgrade, planned downtime, etc. • Methods & Protocols for fault-tolerant systems • SW Fault-Tolerance • Network Availability • IP Routing, HSRP/VRRP, SCTP • Server/Service Availability • Server Farms, Cluster solution, distributed reliability Demonstration: Fault-tolerant call control systems

  8. General Approaches: Fault-tolerance • Basic requirements for fault-tolerance • Number of fault-types and number of faults is bounded • Existence of redundancy (structural: equal/diversity; functional; information; time-redundancy/retries) • Functional Parts of fault-tolerant systems • Fault detection (& diagnosis) • Replication and Comparison (if identical realisations  not suitable for design errors!) • Inversion (e.g. Mathematical functions) • Acceptance (necessary conditions on result, e.g. ranges) • Timing behavior (time-outs) • Fault isolation: prevent spreading • Isolation of functional components, e.g. Atomic actions, layering model • Fault recovery • Backward: Retry, Journalling (Restart), Checkpointing & Rollback (non-trivial in distributed cooperating systems!) • Forward: move to consistent, acceptable, safe new state; but loss of result • Compensation, e.g. TMR, FEC

  9. Software fault-tolerance • Mainly: Design errors & user interaction (as opposed to production errors, wear-out, etc.) • Observations/Estimates (experience in computing centers, numbers a bit old however) • 0.25...10 errors per 1000 lines of code • Only about 30% of error reports by users accepted as errors by vendor • Reaction times (updates/patches): weeks to months • Reliability not nearly as improved as hardware errors, various reasons: • immensely increased software complexity/size • Faster HWmore operations per time-unit executed  higher hazard rate • However, relatively short down-times (as opposed to HW): ca 25 min • Approaches for fault-tolerance • Software diversity (n-version programming, back-to-back tests of components) • However, same specification often leads to similar errors • Forced diversity as improvement (?) • Majority votes for complex result types not trivial

  10. Content • Basic concepts • Fault, failure, fault-tolerance, redundancy types • Availability, reliability, dependability • Life-times, hazard rates • Mathematical models • Availability models, Life-time models, Repair-Models • Performability • Availability and Network Management • Rolling upgrade, planned downtime, etc. • Methods & Protocols for fault-tolerant systems • SW Fault-Tolerance • Network Availability • IP Routing, HSRP/VRRP, SCTP • Server/Service Availability • Server Farms, Cluster solution, distributed reliability Demonstration: Fault-tolerant call control systems

  11. Application TCP/UDP IP Link-Layer Network Connectivity: Basic Resilience • Loss of network connectivity • Link Failures (cable cut) • Router/component failures along the path • Basic resilience features on (almost) every protocol layer • L1+L2: FEC, Link-Layer Retransmission, Resilient Switching (e.g. Resilient Packet Ring) • L3, IP: Dynamic Routing • TCP: Retransmissions • Application: application-level retransmissions, loss-resilient coding (e.g. VoIP, Video) • IP Layer Network Resilience: Dynamic Routing, e.g. OSPF • ’Hello’ packets used to determine adjacencies and link-states • Missing hello packets (typically 3) indicate outages of links or routers • Link-states propagated through Link-state advertisements (LSA) • Updated link-state information (adjacencies) lead to modified path selection L5-7 L4 L3 L2

  12. Dynamic Routing: Improvements • Drawbacks of dynamic routing • Long Duration until newly converged routing tables (30sec up to several minutes) • Rerouting not possible if first router (gateway) fails • Improvements • Speed-up: Pre-determined secondary paths (e.g. Via MPLS) • ’local’ router redundancy: Hot Standby (HSRP, RFC2281) & Virtual Router Redundancy Protocol (VRRP,RFC2338) • Multiple routers on same LAN • Master performs packet routing • Fail-over by migration of ’virtual’ MAC address

  13. SCTP Association Host A IPb1 Host B IPa1 Separate Networks IPa2 IPb2 Streaming Control Transmission Protocol (SCTP) • Defined in RFC2960 (see also RFC 3257, 3286) • Purpose initially: Signalling Transport • Features • Reliable, full-duplex unicast transport (performs retransmissions) • TCP-friendly flow control (+ many other features of TCP) • Multi-streaming, in sequence delivery within streams Avoid head of line blocking (performance issue) • Multi-homing: hosts with multiple IP addresses, path monitoring (heart-beat mechanism), transparent failover to secondary paths • Useful for provisioning of network reliability

  14. Content • Basic concepts • Fault, failure, fault-tolerance, redundancy types • Availability, reliability, dependability • Life-times, hazard rates • Mathematical models • Availability models, Life-time models, Repair-Models • Performability • Availability and Network Management • Rolling upgrade, planned downtime, etc. • Methods & Protocols for fault-tolerant systems • SW Fault-Tolerance • Network Availability • IP Routing, HSRP/VRRP, SCTP • Server/Service Availability • Server Farms, Cluster solution, distributed reliability Demonstration: Fault-tolerant call control systems

  15. Client Firewall / Switch Internet Resilience for state-less applications • State-less applications • Server response only depends on client request • Principle in most IETF protocols, e.g. http • If at all, state stored in client, e.g. ’cookies’ • Redundancy in ’server farms’ (same LAN) • Multiple servers • IP address mapped to (replicated) ’load balancing’ units • HW load balancers • SW load balancers • Load balancers forward requests to specific Server MAC address according to Server Selection Policy, e.g. • Random Selection • Round-Robin • Shortest Response-Time First 1 5 2 4 Gateway Node Gateway NodeBackup Service Nodes 3 • Server failure • Load-Balancer does not select failed server any more • Examples: Web-services, Firewalls

  16. Cluster Framework • SW Layer across several servers within same subnetwork • Functionality • Load Balancer/dispatching functionality • Node Management/Node Supervision/Node Elimination • Network connection: IP aliasing (unsolicited/gratuitous ARP reply)  fail-over times few seconds (up to 10 minutes, if router/switch does not support unsolicited ARP!) • Single Image view (one IP address for cluster) • For communication partners • For OAM • Possible Platform: Server Blades

  17. Reliability middleware: Example RTP • Software Layer on top of cluster framework • All fault-tolerant operations are done within the cluster • One virtual IP address for the whole cluster (UDP dispatcher redirects incoming messages)  “one system image” • Transparency for the client • Redundancy at the level of processes: • Each process is replicated among the servers of the cluster • Only a faulty process is failed over • Notion of node disappears: • Use of logical name (replicated processes can be in the same node) • Middleware Functionality (Example: Reliable Telco Platform, FSC) • Context management • Reliable inter-process communication • Load Balancing for Network traffic (UDP/SS7 dispatcher) • Ticket Manager • Events/Alarming • Support of rolling upgrade • Node Manager • Supervision of processes • Communication of process status to other nodes • Trace Manager (Debugging) • Administrative Framework (CLI, GUI, SNMP)

  18. OS (e.g. Solaris) HW OS (e.g. Solaris) HW OS (e.g. Solaris) HW Example Architecture: Cluster middleware Parlay/ OSA Pay ment GSMHLR Third party applications Manage- ment Agents IN basic Apps-APIs Common Application Framework single image view Component Framework Reliant Telco Platform (RTP) Signalware OPS/ RAC Net worker Corba common basic functions ... Cluster Framework (Primecluster, SUNcluster)

  19. Fail-over Distributed Redundancy: Reliable Server Pooling Server Pool [State-sharing] • Distributed architecture • Servers need only IP address • Entities offering same service grouped into pools accessible by pool handle • Pool User (PU) contact servers (PE) after receiving the response to a name resolution request sent to a name server (NS) • Name Server monitors PEs • Messages for dynamic registration and De-registration • Flat Name Space • Architecture and pool access protocols (ASAP, ENRP) defined in IETF RSerPool WG • Failure detection and fail-over performed in ASAP layer of client PE (A) PE (B) (de-) registration Monitoring Name Server(s) Name Server(s) Name Pool User ASAP=Aggregate Server Access Protocol ENRP=Endpoint Name Resolution Protocol

  20. Rserpool: more details • RSerPool Scenario • Each PE contains full implementation of functionality, no distributed sub-processes (different to RTP)  reduced granularity for possible load balancing More Details: • RSerPool Name Space • Flat Name Space  easier management, performance (no recursive requests) • All name servers in operational scope hold same info (about all pools in operational scope) • Load Balancing • Load factors sent by PE to Name Server (initiated by PE) • In resolution requests, Name Server communicates load factors to PU • PU can use load factors in selection policies

  21. RSerPool Protocols: Functionality (Current status in IETF) • ENRP • State-Sharing between Name Servers • ASAP • (De -)Registration of Pool-elements (PE, Name Server) • Supervision of Pool-elements by a Name Server • Name Resolution (PU, Name Server) • PE selection according to policy (& load factors) (PU) • Failure detection based on transport layer (e.g. SCTP timeout) • Support of Fail-over to other Pool-element (PU-PE) • Business cards (pool-to-pool communication) • Last will • Simple Cookie Mechanism (usable for Pool Element state information) (PU-PE) • Under discussion: Application Layer Acknowledgements (PU-PE)

  22. RserPool and ServerBlades • Server Blades • Many processor cards in one 19‘‘racks  space efficiency • Integrated Switch (possibly duplicated) for external communication • Backplane for internal communication (duplicated) • Management support: automatic configuration & code replication • Combination RSerPool on Server Blades • No additional load balancing SW necessary (but less granularity) • Works on any heterogeneous system without additional implementation effort (e.g. Server Blades + Sun Workstations) • Standardized protocol stacks, no implementation effort in clients (except for use of RSerPool API)

  23. References/Acknowledgments • M. Bozinovski, T. Renier, K. Larsen: ‘Reliable Call Control’, Project reports and presentations, CPK/CTIF--Siemens ICM N, 2001-2004. • Siemens ICM N, S. Uhle and colleagues • Fujitsu Siemens Computers (FSC), ‘Reliable Telco Platform’, Slides. • E. Jessen, ‘Dependability and Fault-tolerance of computing systems’, lecture notes (German), TU Munich, 1996.

  24. Summary • Basic concepts • Fault, failure, fault-tolerance, redundancy types • Availability, reliability, dependability • Life-times, hazard rates • Mathematical models • Availability models, Life-time models, Repair-Models • Performability • Availability and Network Management • Rolling upgrade, planned downtime, etc. • Methods & Protocols for fault-tolerant systems • SW Fault-Tolerance • Network Availability • IP Routing, HSRP/VRRP, SCTP • Server/Service Availability • Server Farms, Cluster solution, distributed reliability Demonstration: Fault-tolerant call control systems

  25. Course Overview & Feedback

  26. PhD Course: Overview • Day 1 Basics & Simple Queueing Models(HPS) • Day 2 Traffic Measurements and Traffic Models (MC) • Day 3 Advanced Queueing Models & Stochastic Control (HPS) • Day 4 Network Models and Application (HS) • Day 5 Simulation Techniques (SA) • Day 6 Reliability Aspects (HPS,HS)

  27. Day 1: Basics & Simple Queueing Models • Intro & Review of basic stochastic concepts • Random Variables, Exp. Distributions, Stochastic Processes, Poisson Process & CLT • Markov Processes: Discrete Time, Continuous Time, Chapman-Kolmogorov Eqs., steady-state solutions • Simple Qeueing Models • Kendall Notation • M/M/1 Queues, Birth-Death processes • Multiple servers, load-dependence, service disciplines • Transient analysis • Priority queues • Simple analytic models for bursty traffic • Bulk-arrival queues • Markov modulated Poisson Processes (MMPPs) • MMPP/M/1 Queues and Quasi-Birth Death Processes

  28. Day 2: Traffic Measurements and Models(Mark) • Morning: Performance Evaluation • Marginals, Heavy-Tails • Autocorrelation • Superposition of flows, alpha and beta traffic • Long-Range Dependence • Afternoon: Network Engineering • Single Link Analysis • Spectral Analysis, Wavelets, Forecasting, Anomaly Detection • Multiple Links • Principle Component Analysis

  29. Day 3: Advanced Queueing Models • Matrix Exponential (ME) Distributions • Definition & Properties: Moments, distribution function • Examples • Truncated Power-Tail Distributions • Queueing Systems with ME Distributions • M/ME/1//N and M/ME/1 Queue • ME/M/1 Queue • Performance Impact of Long-Range-Dependent Traffic • N-Burst model, steady-state performance analysis • Transient Analysis • Parameters, First Passage Time Analysis (M/ME/1) • Stochastic Control: Markov Decision Processes • Definition • Solution Approaches • Example

  30. Day 4: Network Models (Henrik) • Queueing Networks • Jackson Networks • Flow-balance equations, Product Form solution • Deterministic Network Calculus • Arrival Curves • Service Curves: Concatenation, Prioritization • Queue-Lengths, Output Flow • Loops • Examples: Leaky Buckets,

  31. Day 5: Simulation Techniques (Hans, Soeren) • Basic concepts • Principles of discrete event simulation • Simulation tools • Random Number Generation • Output Analysis • Terminating simulations • Non-terminating/steady state case • Rare Event Simulation I: Importance Sampling • Rare Event Simulation II: Adaptive Importance Sampling

  32. Day 6: Reliability Aspects • Basic concepts • Fault, failure, fault-tolerance, redundancy types • Availability, reliability, dependability • Life-times, hazard rates • Mathematical models • Availability models, Life-time models, Repair-Models • Performability Models • Availability and Network Management • Rolling upgrade, planned downtime, etc. • Methods & Protocols for fault-tolerant systems • SW Fault-Tolerance • Network Availability • IP Routing, HSRP/VRRP, SCTP • Server/Service Availability • Server Farms, Cluster solution, distributed reliability Demonstration: Fault-tolerant call control systems

More Related