280 likes | 430 Views
Internet Quality-of-Service (QoS). Henning Schulzrinne Columbia University Fall 2003. Quality of Service. Motivation Service availability Elementary queueing theory Traffic characterization & control Integrated services (RSVP, NSIS) Differentiated services (DiffServ).
E N D
Internet Quality-of-Service (QoS) Henning Schulzrinne Columbia University Fall 2003
Quality of Service • Motivation • Service availability • Elementary queueing theory • Traffic characterization & control • Integrated services (RSVP, NSIS) • Differentiated services (DiffServ)
What is quality of service? • Many applications are sensitive to the effects of delay (+ jitter) and packet loss • may have “floor” below which utility drops to zero • The existing Internet architecture provides a best effort service. • All traffic is treated equally (generally, FIFO queuing) • No mechanism for distinguishing between delay sensitive and best effort traffic • Original IP architecture (IPv4) has TOS (type-of-service byte) in packet header • RFC 795: defined multiple axes (delay, throughput, reliability) • rarely used outside some (rumor) military networks utility ($) bandwidth
Motivation • QoS service availability • not good enough if all but 2 minutes of my phone call sound perfect • Support mission-critical applications that can’t tolerate disruption • VoIP • VPNs (LAN emulation) • high-availability computing • Charge more for business applications vs. consumer applications
Service availability • Users do not care about QoS • at least not about packet loss, jitter, delay • rather, it’s service availability how likely is it that I can place a call and not get interrupted? • availability = MTBF / (MTBF + MTTR) • MTBF = mean time between failures • MTTR = mean time to repair • availability = successful calls / first call attempts • equipment availability: 99.999% (“5 nines”) 5 minutes/year • AT&T (2003): • Sprint IP frame relay SLA: 99.5%
Availability – PSTN metrics • PSTN metrics (Worldbank study): • fault rate • “should be less than 0.2 per main line” • fault clearance (~ MTTR) • “next business day” • call completion rate • during network busy hour • “varies from about 60% - 75%” • dial tone delay
Example PSTN statistics Source: Worldbank
Measurement setup • Active measurements • call duration 3 or 7 minutes • UDP packets: • 36 bytes alternating with 72 bytes (FEC) • 40 ms spacing • September 10 to December 6, 2002 • 13,500 call hours
Call success probability • 62,027 calls succeeded, 292 failed 99.53% availability • roughly constant across I2, I2+, commercial ISPs
Overall network loss • PSTN: once connected, call usually of good quality • exception: mobile phones • compute periods of time below loss threshold • 5% causes degradation for many codecs • others acceptable till 20%
Network outages • sustained packet losses • arbitrarily defined at 8 packets • far beyond any recoverable loss (FEC, interpolation) • 23% outages • make up significant part of 0.25% unavailability • symmetric: AB BA • spatially correlated: AB AX • not correlated across networks (e.g., I2 and commercial)
Outage-induced call abortion probability • Long interruption user likely to abandon call • from E.855 survey: P[holding] = e-t/17.26 (t in seconds) • half the users will abandon call after 12s • 2,566 have at least one outage • 946 of 2,566 expected to be dropped 1.53% of all calls
Conclusions from measurement • Availability in space is (mostly) solved availability in time restricts usability for new applications • initial investigation into service availability for VoIP • need to define metrics for, say, web access • unify packet loss and “no Internet dial tone’’ • far less than “5 nines” • working on identifying fault sources and locations • looking for additional measurement sites
What’s next? • Existing SLAs are mostly useless • too many exceptions • wrong time scales: month vs. minutes • no guarantees for interconnects • Existing measurements similarly dubious • Limited ability to learn from mistakes • what are the primary causes of service unavailability? • what can I do to protect myself – multi-homing via same fiber? diverse access mechanisms? • Consumers of services have no good ways to compare service availability • only some very large customers may get access to carrier-internal data • Thus, market failure • Need published metrics • similar to switch availability reporting
What's hard to scale (and not) • Signaling does not have be hard: • one message, on a reliable peering channel or IP router alert option • NSIS effort in the IETF? • YESSIR: RTCP-based signaling • 700 MHz Celeron processor • 10,000 flow setups/second 300,000 softstate flows • If scaling matters, sink-tree based reservation (BGRP)
Diversity is good • Unlike routing, no need for single signaling protocol: • multicast is much harder • dumb end devices • edge "pop-up" only show up in edge nodes
AAA • Signaling can easily be done in ASIC (no harder than IP), but • need cryptographic verification of request • need interface to Authentication, Authorization, Accounting (AAA) • cross-domain authentication hard, but 3G networks will do it anyway • easier if both sides ask their own access router • see also: iPass for dial-up, OSP (open settlement protocol)
AAA example reserves for both directions Internet AR1 AR2 source destination signs request Cell phone model: both sides pay
Reservation scaling • Example: every long-distance call in the US uses VoIP with per-flow resource reservation • 2000: 567.4 billion minutes @ 10 minutes each 1,800 calls/second • single mySQL server can sustain 500—2,000 queries+updates/second
Business models don't work • Most of the time, "tin" service is no worse than "platinum" service • can't impress others with platinum AmEx card • no frequent flyer bonuses • everybody switches only when the network is in bad shape
QoS queuing Best-effort queuing Resource control & reservation Application Tspec Y/N Reservation Protocol Admission Control Routing Protocols & DBs Traffic Control DB Classifier & route selection Packet Scheduler Data USC EE-S 555
TCP synchronization effect during overload, many connections lose packets and go into slowstart RED: start dropping based on average queue occupancy (vs. instantaneous queue occupancy) Parameter setting critical and non-trivial See also RFC 2309 RED (Random Early Detection)
ECN (Explicit Congestion Notification) • Extension of RED: mark instead of drop • RFC 2481 (“A Proposal to add Explicit Congestion Notification (ECN) to IP”) • IP TOS6 bit indicates congestion: ECN • IP TOS7 bit indicates support for mechanism • Needs cooperation of TCP (or similar protocols) • TCP should act almost as if packet was dropped • ½ congestion window • but don’t do slow-start ECT=1 ECN=0 ECT=1 ECN=1 TCP ACK: ECN echo
Next steps in signaling (NSIS) • RSVP not widely used for resource reservation • but is used for MPLS path setup • design heavily biased by multicast needs • marginal and after-the-fact security • limited support for IP mobility • Thus, IETF NSIS working group developing new framework for general state management protocol • resource reservation • NAT and firewall control • traffic and QoS measurement • MPLS and lambda path setup • Split into two components: • NSLP: services • NTLP: transport
NSIS • On-path vs. off-path • off-path bandwidth brokers • Discovery of next NTLP or NSLP hop • use router alert option QoS NAT/FW measure NTLP SCTP UDP TCP SCTP