SPUD A Distributed High Performance Publish-Subscribe Cluster

SPUDA Distributed High Performance Publish-Subscribe Cluster Uriel Peled and Tal Kol Guided by Edward Bortnikov Software Systems Laboratory Faculty of Electrical Engineering, Technion

Project Goal • Design and implement a general-purpose Publish-Subscribe server • Push traditional implementations into global scale performance demands • 1 million concurrent clients • Millions of concurrent topics • High transaction rate • Demonstrate server abilities with a fun client application

topic://traffic-jams/ayalon What is Pub/Sub? accidentin hashalom publish subscribe accidentin hashalom

What Can We Do With It?Collaborative Web Browsing others: others:

What Can We Do With It?Instant Messaging Hi buddy! Hi buddy!

Seems Easy To Implement, But… • “I’m behind a NAT, I can’t connect!” • Not all client setups are server friendly • “Server is too busy, try again later?!” • 1 million concurrent clients is simply too much • “The server is so slow!!!” • Service time grows exponentially with load • “A server crashed, everything is lost!” • Single points of failure will eventually fail

NAT Naïve Implementation(example 1) • Simple UDP for client-server communication • No need for sessions since we send messages • Very low cost-per-client • Sounds perfect?

NAT Traversal • UDP hole punching • NAT will accept UDP reply for a short window • Our measurements: 15-30 seconds • Keep UDP pinging from each client every 15s • Days-long TCP sessions • NAT remembers current sessions for replies • If WWW works - we should work • Increases dramatically cost-per-client • Our research: all IM’s do exactly this

500 clients 500 clients 500 clients Naïve Implementation(example 2) • Blocking I/O with one thread per client • Basic model for most servers (JAVA default) • Traditional UNIX – fork for every client • Sounds perfect?

Network I/O Internals • Blocking I/O – one thread per client • 2MB stack, 1GB virtual space enough for only 512 (!) • Non-blocking I/O - select • Linear fd searches are very slow • Asynchronous I/O – completion ports • Thread pool to handle request completion • Our measurements: 30,000 concurrent clients! • What is the bottleneck? • Number of locked pages (zero-byte receives) • TCP/IP kernel driver non-paged pool allocations

Scalability • Scale up • Buy a bigger box • Scale out • Buy more boxes • Which one to do? • Both! • Push each box to its hardware maximum • 1000’s of servers is impractical • Add relevant boxes as load increases • The Google way (cheap PC server farms)

Identify Our Load Factors • Concurrent TCP clients • Scale up: async-I/O, 0-byte-recv, larger NPP • Scale out: dedicate boxes to handle clients=> Connection Server (CS) • High transaction throughput (topic load) • Scale up: software optimizations • Scale out: dedicate boxes to handle topics => Topic Server (TS) • Design the cluster accordingly

Network Architecture

Client Load Balancing load balance: - user location - CS client load CS1 given CS2 CLB TS1 CS2 TS2 request CS login publish subscribe CS3

Topic Load BalancingStatic Room 0 TS0 TS1 TS2 subscribe: 923481%4=1 CS subscribe: traffic TS3

Topic Load BalancingDynamic Room 1 handle subscribe subscribe R0: 345K R1: 278K R2: ? TS1 Room 0 TS1 TS1 subscribe R0: 345K R1: 278K R2: 301K subscribe R0: 345K R1: ? R2: ? subscribe R1: 278K Room 2 subscribe CS

Performance Pitfalls • Data Copies • Single instance - reference counting (REF_BLOCK) • Multi-buffer messages (MESSAGE: header, body, tail) • Context Switches • Flexible module exec foundation (MODULE) • Processor num sized thread pools • Memory Allocation • MM: custom memory pools (POOL, POOL_BLOCK) • fine-grained locking, pre-allocation, batching, single-size • Lock Contention • EVENT, MUTEX, RW_MUTEX, interlocked API

Class Diagram (Application)

Class Diagram (TS, CS)

Stress Testing • Measure publish-notify turnaround time • 1 ms resolution using MM timer, avg. of 30 • Increasing client and/or topic load • Several room topologies examined • Results: • Exponential-like climb • TS increase: better times • CS increase: better max clients time not improved

SPUD A Distributed High Performance Publish-Subscribe Cluster