Towards High-Availability for IP Telephony using Virtual Machines

Towards High-Availability for IP Telephony using Virtual Machines Devdutt Patnaik, Ashish Bijlani and Vishal K Singh

Outline • Virtualization • High Availability (HA) in Virtualized Platforms • XEN and REMUS (HA solution for XEN) • Remus applied to IP Telephony (IPT) applications • Scalability and Reliability of IPT applications using Virtualization • Experimental Results • Conclusion

Virtualization and its Benefit • Abstraction layer (Hypervisor) between the physical hardware and the OS. • Single physical machine can host multiple virtual machines each running a different OS + application stack • VMMs • Xen, VMWare, Microsoft HyperV • Benefits • Server consolidation • Green computing • Cost savings – space and power • High Availability Reliability solutions, ease of upgrades with near zero down-times

Virtualized hosting for IP Telephony • Virtualized hosting for IP Telephony already available • Avaya, Cisco, Asterix etc. • IP Telephony in Cloud • Scalability: ability to elastically add/remove additional servers while supporting High-Availability for all servers • Reliability: protection against hardware and software failures • HA features in virtualization platforms • Memory state check pointing

Virtualization and High Availability • Seamless fail-over, Efficient and transparent migration of VM to another physical machine • Live Migration with very small down-times • Minimal or no impact to client nodes • Asynchronous check-pointing • Continuously syncs the state between the primary and secondary host • We use • Remus: A High Availability Solution for XEN

Remus on XEN • Remus is a High Availability solution available on the Xen VMM • Remus uses continuous check-pointing and keeps a consistent client view of network state • The secondary machine hosts a paused replica of the primary VM • Uses a heart-beat mechanism • Failure to receive periodic heart-beat on secondary will un-pause the backup VM • Heart beat time-out can be configured Fig 1 Image: http://osnet.cs.nchu.edu.tw/powpoint/seminar/2008/Remus.pdf

Remus on XEN (contd.) • Remus modes of operation • Net Mode – Highly reliable • No-Net Mode – better performance with negligible packet loss in case of failure • Tunable for Reliability vs. Performance Fig. 2 Disk writes and Network Writes • Net Mode: Buffers outgoing network packets until execution state is synced with the back up VM (on secondary host). • reliability at cost of performance Image: http://osnet.cs.nchu.edu.tw/powpoint/seminar/2008/Remus.pdf

Remus applied to IP Telephony- Scale with Reliability • Our work using HA in XEN extends: “architecture for fail-over and load sharing for IP Telephony” proposed by Kundan Singh et. al. • Challenges: • Overheads of virtualization on IP Telephony performance • Co-Hosted/Co-located media server causes interference because of heavy I/O workload

Reliability and Scalability using Virtual Machines • Scalability using load balancer (LB) • LB can elastically add more VMs as demand grows • Reliability using Remus in XEN Stateless Load balancer Reliability Architecture using Virtual machines • For every primary Virtual Machine there is a back up VM in paused state. • Since, backup VM is paused, it allows to place other running VMs on the same physical machine • Provides N to M elastic/backup model (m back up for n primary)

Reliability and Scalability using Virtual Machines (contd.) • Reliability • Provided by Xen + Remus • Failure of primary starts the execution of the secondary with IP address takeover • Clients continue to execute un-affected • Signaling and Media Server: • Co-located on same VM • allows better utilization, • no overhead of inter-vm communication • Placed on different VM • elastic scaling of media and signaling VM’s

Studying Performance Implications • Experimental setup • Primary /Backup Servers • Intel Core 2 Quad Processors, 2.5 Ghz, 8 GB RAM, 4MB L2 Cache • Hypervisor – Xen 3.2.1 + Remus • Default Credit Scheduler configuration • Guest OS : Para Virtualized Linux 2.6.18 • IP Telephony Workload • Modeled our workload using SIPStone • Measured % success of registrations during failover • Used UDP and TCP as transport for registrations • Used OpenSIPs as SIP server • RTPProxy as Media Server • SIPp for generating signaling and media traffic

Analysis and Results: Signaling • Guest VM and Domain 0 both have high CPU utilization with tcp_n (new tcp connection for each REGISTER) • UDP and tcp_1 (1 tcp connection for all REGISTER) have similar overhead. CPU utilization (in guest VM, dom0) Udp means with udp transport, tcp_1 means same connection for all call, tcp_n means new connection for each call With Remus NET mode, Registration overhead.

Analysis and Results: Signaling • CPU overhead increases with proportionately with signaling loads • Dom0 has significant overheads due to check-pointing overheads. • Net Mode gives good results for Signaling • With 1400 regs/sec failure was induced • with 100% completion of all by failover to the back up

Analysis and Results: Media • Media loads with Net Mode gives poor results • Media with No-Net gives good performance even with 400 streams with 2% losses • This can be further reduced by tweaking scheduler parameters • 100% fail-over of all calls in progress during media experiments Net Mode 100, 200, 400, 600 and 800 streams No Net Mode 100, 200, 400, 600 and 800 streams

Conclusion • Using No-Net mode for media streams gives us a balance between performance(loss and delay) and reliability(failover) while still being able to migrate 100% of all calls in progress (using TCP) which is a significant result • Net Mode for Signaling is a good configuration with 100% registration completion with failover • No-Net mode for the Media server deployment provides significant improvement in performance: loss and delay reduces significantly • While the No-Net configuration performs better for media, it may not provide call completion guarantees during the fail-over operation for signaling • Migration of user registration and call setup operations was 100% successful

Contributions • Extended load sharing and failover architecture using Virtualization • Proposed use of high availability feature in virtualized platforms to achieve reliability in IP Telephony • Proposed placement scheme of signaling and media applications for scale(elasticity) and efficiency (utilization) • Systematic evaluation of overheads involved in use of virtualization for IP Telephony Applications • Demonstrated that High Availability using Virtual Machines can be deployed for medium scale IP Telephony infrastructure

Future Work • More detailed analysis of overheads • Overhead because of check pointing in virtualization platform • Overhead because of I/O in Domain 0 • Propose solutions to improve performance • Improve I/O handing in XEN VMM • Propose better VM placement algorithm for IP Telephony applications • Utilizing fine grained overhead measurements for resource allocation • Considering I/O (media) vs. memory (signaling state replication) optimizations • Elasticity with co-location of media and signaling server on same VM

Questions • vs2140@columbia.edu

Towards High-Availability for IP Telephony using Virtual Machines

Towards High-Availability for IP Telephony using Virtual Machines

Presentation Transcript

IP Telephony Applications for Handhelds

IP Telephony

Towards Junking the PBX: Deploying IP Telephony

IP Telephony

IP Telephony

IP Telephony

IP Telephony

Towards Junking the PBX: Deploying IP Telephony

Deploying IP Telephony

IP Telephony

IP Telephony

IP Telephony

Progress Towards Petascale Virtual Machines

IP Telephony

IP Telephony

UW IP Telephony

IP Telephony Market

IP Telephony Market

IP Telephony Market

IP Telephony

Towards Junking the PBX: Deploying IP Telephony

IP Telephony