670 likes | 792 Views
QoS Support in Operating Systems. Banu Özden Bell Laboratories ozden@research.bell-labs.com. Vision. Service providers will offer storage and computing services through their distributed data centers connected with high bandwidth networks to globally distributed clients.
E N D
QoS Support in Operating Systems Banu Özden Bell Laboratories ozden@research.bell-labs.com
Vision • Service providers will offer storage and computing services • through their distributed data centers • connected with high bandwidth networks • to globally distributed clients. • Clients will access these services via diverse devices and networks, e.g.: • mobile devices and wireless networks, • high-end computer systems and high bandwidth networks. • These services will become utilities (e.g., storage utility, computing utility). • Eventually resources will be exchanged and traded between geographically dispersed data centers to address fluctuating demand.
Eclipse/BSD:an Operating System with Quality of Service Support Banu Özden ozden@research.bell-labs.com
Motivation • QoS support for (server) applications: • web servers • video servers • Isolation and differentiation of different • entities serviced on the same platform • applications running on the same platform • QoS requirements: • client-based • service-based • content-based
Design Goals • QoS support in a general purpose operating system • Remain compatible with the underlying operating system • QoS parameters: • Isolation • Differentiation • Fairness • (Cumulative) throughput • Flexible resource management • capable of implementing a large set of provisioning needs • supports a large set of server applications without imposing significant changes to their design
Talk Outline • Schedulers • Reservation File System (reservfs) • Tagging • Web Server Experiments • Access Control and Profiles • Eclipse/BSD Status • Related Work • Future Work
Proportional sharing • Generalized processor sharing (GPS) weight of flow i service received by flow i in set of flows • For any flow i continuously backlogged in • Thus, rate of flow i in is:
QoS Guarantees • Fairness • Throughput • Packet delay
Schedulers in Eclipse • Resource characteristics differ • Different hierarchical proportional-share schedulers for resources • Link scheduler: WF2Q • Disk scheduler: YFQ • CPU scheduler: MTR-LS • Network input: SRP
server server 0.8 0.2 0.4 0.2 0.4 company A company B company A page 1 company A page 2 company B 0.5 0.5 page 1 page 2 Hierarchical GPS Example hierarchical proportional sharing proportional sharing
Schedulers • Hierarchical proportional-sharing (GPS) descendant queue nodes of node n serviced received by scheduler node n in set of immediate descendant nodes of the parent of node n • For any node n continuously backlogged in
link scheduler link scheduler Link Aggregation • Need to incrementally scale bandwidth • Resource aggregation is emerging as a solution: • Grouping multiple resources into a single logical unit • QoS over such aggregated links?
GPS MSFQ Nr Nr … r r r Multi-Server Model • Multi Server Fair Queuing (MSFQ) • A packetized algorithm for a system with N links, each with a bandwidth of r, that approximates a GPS system with a single link with Nr bandwidth Reference model Packetized scheduler
Multi-Server Model (Contd.) • Goals: • Guarantee bandwidth and packet delay bounds that are independent of the number of flows • Allow flows arrive and depart dynamically • Be work-conserving • Algorithm: • When a server is idle, schedule the packet that would complete transmission earliest under a single server GPS system with a bandwidth of Nr Sigcomm 2001
a1 a2 a1 a2 GPS GPS 1 2 1 2 MSFQ serv1 WFQ 1 serv 1 2 serv2 2 time = 0 1 2 3 4 time = 0 1 2 3 4 a1 a2 a3 a4 a5 a6 a7 GPS 1 2 3 4 5 6 7 … serv1 6 1 4 … 7 2 5 serv2 MSFQ 3 serv3 time = 0 1 2 3 4 5 6 7 8 9 10 MSFQ Preliminary Properties Multi-Server specific properties • Ordering: a pair of packets scheduled in the order of their GPS finishing times may complete in reverse order • GPS busy MSFQ busy, but converse is not true • Non-coinciding busy periods • Work backlog?
GPS service MSFQ Packet delay time GPSi service MSFQi Service discrepancy time MSFQ Properties • Maximum service discrepancy (buffer requirement) • Maximum packet delay • Maximum per-flow service discrepancy
Schedulers (contd.) • Disk scheduling with QoS • tradeoffs between QoS and total disk performance • driver queue management • queue depth • queue ordering • fragmentation • Hierarchical YFQ • CPU scheduling with QoS • length of cpu phases are not known a priori • cumulative throughput • Hierarchical MTR-LS
Eclipse’s Key Elements • Hierarchical, proportional share resource schedulers • Reservation, reservation file system (reservfs) • Tagging mechanism • Access and admission control, reservation domain
Reservations and Schedulers • (Resource)reservations • unit for QoS assignment • similar to the concept of a flow in packet scheduling • Hierarchical schedulers • a tree with two kinds of nodes: • scheduler nodes • queue nodes • each node corresponds to a reservation • Schedulers are dynamically reconfigurable
disk bandwidth cpu cycles 0.8 0.8 0.8 0.2 0.2 0.2 company A company B company A company B 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 page 1 page 1 page 1 page 1 page 2 page 2 page 2 page 2 Web Server Example • Hosting two companies’ web sites, each with two web pages network bandwidth company A company B
Web Server Video Server Application Interface Reservation file system Scheduler Interface CPU scheduler Link scheduler Disk scheduler 1 Disk scheduler 2 Net 1 Net 2 CPU 1 CPU 1 Disk1 Disk2 Disk3 Reservfs • We built the reservation file system • to create and manipulate reservations • to access and configure resource schedulers
/reserv cpu fxp0 fxp1 da0 Reservfs • Hierarchical • Each reservation directory corresponds to a node at a scheduler • Each resource is represented by a reservation directory under /reserv
Reservfs • Two types of reservation directories: • scheduler directories • queue directories • Scheduler directories are hierarchically expandable • Queue directories are not expandable
/reserv cpu fxp0 fxp1 ca0 q0 q0 r1 q0 q0 q1 q0 share newqueue newreserv share backlog Reservfs • Scheduler directory: • share • newqueue • newreserv • special queue: q0 • Queue directory: • share • backlog
CPU scheduler Link scheduler Disk scheduler Net 1 Net 2 CPU 1 Disk1 Disk2 Reservfs Web Server Video Server Application Interface: Reservation file system Scheduler Interface:
Reservfs API • Creation of a new queue/scheduler reservation • fd=open(newqueue/newreserve,O_CREAT) • fd of newly created share file
da0 q1 q0 q1 share newqueue newreserv share backlog Creating Queue Reservation /reserv cpu fxp0 fxp1 da0 q0 q0 r1 q0 q0 q0 q1 fd= open(“newqueue”,O_CREAT)
da0 da0 q0 q1 r0 r0 q0 q1 share newqueue newreserv q0 share newreserv newqueue fd= open(“newreserv”,O_CREAT) Creating Scheduler Reservation /reserv cpu fxp0 fxp1 q0 q0 r1 q0 q0 q1
Reservfs API • Changing QoS parameters • writing a weight and min value to the share file • Getting QoS parameters • reading the share file • Getting/setting queue parameters • reading/writing the backlog file
Reservfs API Command line output: killerbee$ cd /reserv killerbee$ ls -al total 5 dr-xr-xr-x 0 root wheel 512 Sep 15 11:37 . drwxr-xr-x 20 root wheel 512 Sep 12 21:54 .. dr-xr-xr-x 0 root wheel 512 Sep 15 11:37 cpu dr-xr-xr-x 0 root wheel 512 Sep 15 11:37 fxp0 dr-xr-xr-x 0 root wheel 512 Sep 15 11:37 fxp1 killerbee$ cd fxp0 killerbee$ ls -alR total 6 dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 . dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 .. -rw------- 1 root wheel 1 Sep 15 11:39 newqueue -rw------- 1 root wheel 1 Sep 15 11:39 newreserv dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 q0 -r-------- 1 root wheel 1 Sep 15 11:39 share ./q0: total 4 dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 . dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 .. -rw------- 1 root wheel 1 Sep 15 11:39 backlog -rw------- 1 root wheel 1 Sep 15 11:39 share
Reservfs API killerbee$ cd r0 killerbee$ ls -al total 6 dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 . dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 .. -rw------- 1 root wheel 1 Sep 15 11:39 newqueue -rw------- 1 root wheel 1 Sep 15 11:39 newreserv dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 q0 -r-------- 1 root wheel 1 Sep 15 11:39 share killerbee$ echo “50 1000000” > newqueue killerbee$ ls -al total 6 dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 . dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 .. -rw------- 1 root wheel 1 Sep 15 11:39 newqueue -rw------- 1 root wheel 1 Sep 15 11:39 newreserv dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 q0 dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 q1 -r-------- 1 root wheel 1 Sep 15 11:39 share killerbee$ cd q1 killerbee$ ls -al total 4 dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 . dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 .. -rw------- 1 root wheel 1 Sep 15 11:39 share -rw------- 1 root wheel 1 Sep 15 11:39 backlog killerbee$ cat share 50 1000000 killerbee$
CPU scheduler Link scheduler Disk scheduler Net 1 Net 2 CPU 1 Disk1 Disk2 Reservfs Web Server Video Server Application Interface: Reservation file system Scheduler Interface:
Reservfs Scheduler Interface • Schedulers registers by providing the following interface routines via reservfs_register(): • init(priv) • create(priv, parent, type) • start(priv, parent, type) • delete(priv, node) • get/set(priv, node, values, type)
Reservfs Implementation • Built via vnode/vfs interface • A reserv{} structure represents each reservfs file • reserv{} representing a directory contains a pointer to the corresponding node at scheduler • Scheduler independent • Implements garbage collection mechanism
Talk Outline • Introduction • Schedulers • Reservation File System (reservfs) • Tagging • Web Server Experiments • Access Control and Profiles • Eclipse/BSD Status • Related Work • Future Work
Tagging • A request arriving at a scheduler must be associated with the appropriate reservation • Each request is tagged with a pointer to a queue node • mbuf{}, buf{} and proc{} are augmented • How is a request tagged?
Tagging (contd.) • For a file, its file descriptor is tagged with a disk reservation • For a connected socket, its file descriptor is tagged with a network reservation • For unconnected sockets, we provide a late tagging mechanism • Each process is tagged with a cpu reservation • We associate reservations with references to objects
Default List of a Process • Default reservations of a process, one for each resource • A list of tags (pointers to queue directories) • Used when a tag is otherwise not specified • Two new files are added for each process pid in /proc/pid • /proc/pid/default to represent the default list • /proc/pid/cdefault to represent the child default list
Default List of a Process (contd.) • Reading these file returns the name of default queue directories, e.g., /reserv/cpu/q1 /reserv/fxp0/r2/q1 /reserv/da0/r1/q3 • A process, with the appropriate access rights, can change the entries of default files
Implicit Tagging • The file descriptor returned by open(), accept() or connect() is automatically tagged with default • The tag of the file descriptor of an unconnected socket is set to default at sendto() and sendmesg() • When a process forks, the child process is tagged with the default cpu reservation
Explicit Tagging • The tag of a file descriptor can be set/read with new commands to fcntl(): • F_SET_RES • F_GET_RES • A new system call chcpures() to change the cpu reservation of a process
Reservation Domains • Permissions of a process to use, create and manipulate reservations • The reservation domain of a process is independent of its protection domain
disk bandwidth network bandwidth cpu cycles 0.8 0.8 0.8 0.2 0.2 0.2 reserv A reserv B reserv A reserv B reserv A reserv B 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 reserv 1 reserv 2 reserv 1 reserv 2 reserv 1 reserv2 reserv 1 reserv2 Reservations and Reservation Domains Reservationdomain 1 Reservation domain 2
Reservfs Garbage Collection • Based on reference counts • every application that is using a specific node adds a reference on it (to the vnode) • Triggered by the vnode layer • when the last application finishes using the node this is garbage collected • fcntl() available to maintain the node even if no references to it exist
SRP Input Processing • Demultiples incoming packets • before network and higher-level protocol processing • Unprocesed input queue per socket • Processes input protocols in context of receiving process • Drops packets when per-socket queue is full • Avoids receive livelock
Talk Outline • Introduction • Schedulers • Reservation File System (reservfs) • Tagging • Web Server Experiments • Access Control and Profiles • Eclipse/BSD Status • Related Work • Future Work
QoS Support for Web Server • Virtual hosting with Apache server: • separate Apache server for each virtual host • single Apache server for all virtual hosts • Eclipse/BSD isolates and differentiates performance of virtual hosts • multiple Apache servers----implicit tagging • single Apache server----explicit tagging • We implemented an Apache module for explicit tagging
Experimental Setup • Apache Web Server: • A multi-process server • (Pre)spawns helper processes • A process handles one request at a time • Each process calls accept() to service the next connection request • HTTP clients run on five different machines • Servers are running FreeBSD 2.2.8 or Eclipse/BSD 2.2.8 on a PC (266 MHz Pentium Pro, 64 MB RAM, 9 GB Seagate ST39173W fast wide SCSI disk) • Machines are connected with a 10/100 Mbps Ethernet switch
/reserv cpu fxp0 da0 q0 q0 q0 q1 q1 q1 q2 q2 q2 Experiments • Hosting two sites with two servers Reservation domain of server 1 Reservation domain of server 2