1 / 75

High-performance Multithreaded Producer-consumer Designs – from Theory to Practice

High-performance Multithreaded Producer-consumer Designs – from Theory to Practice. Bill Scherer (University of Rochester) Doug Lea (SUNY Oswego) Rochester Java Users’ Group April 11, 2006. java.util.concurrent. General purpose toolkit for developing concurrent applications

apu
Download Presentation

High-performance Multithreaded Producer-consumer Designs – from Theory to Practice

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High-performance Multithreaded Producer-consumer Designs – from Theory to Practice Bill Scherer (University of Rochester) Doug Lea (SUNY Oswego) Rochester Java Users’ Group April 11, 2006

  2. java.util.concurrent • General purpose toolkit for developing concurrent applications • No more “reinventing the wheel”! • Goals: “Something for Everyone!” • Make some problems trivial to solve by everyone • Develop thread-safe classes, such as servlets, built on concurrent building blocks like ConcurrentHashMap • Make some problems easier to solve by concurrent programmers • Develop concurrent applications using thread pools, barriers, latches, and blocking queues • Make some problems possible to solve by concurrency experts • Develop custom locking classes, lock-free algorithms Scherer & Lea

  3. Overview of j.u.c • Concurrent Collections • ConcurrentMap • ConcurrentHashMap • CopyOnWriteArray{List,Set} • Synchronizers • CountDownLatch • Semaphore • Exchanger • CyclicBarrier • Locks: java.util.concurrent.locks • Lock • Condition • ReadWriteLock • AbstractQueuedSynchronizer • LockSupport • ReentrantLock • ReentrantReadWriteLock • Atomics: java.util.concurrent.atomic • Atomic[Type] • Atomic[Type]Array • Atomic[Type]FieldUpdater • Atomic{Markable,Stampable}Reference • Executors • Executor • ExecutorService • ScheduledExecutorService • Callable • Future • ScheduledFuture • Delayed • CompletionService • ThreadPoolExecutor • ScheduledThreadPoolExecutor • AbstractExecutorService • Executors • FutureTask • ExecutorCompletionService • Queues • BlockingQueue • ConcurrentLinkedQueue • LinkedBlockingQueue • ArrayBlockingQueue • SynchronousQueue • PriorityBlockingQueue • DelayQueue

  4. Key Functional Groups • Executors, Thread pools and Futures • Execution frameworks for asynchronous tasking • Concurrent Collections: • Queues, blocking queues, concurrent hash map, … • Data structures designed for concurrent environments • Locks and Conditions • More flexible synchronization control • Read/write locks • Synchronizers: Semaphore, Latch, Barrier • Ready made tools for thread coordination • Atomic variables • The key to writing lock-free algorithms Scherer & Lea

  5. Part I: Theory Scherer & Lea

  6. Synchronous Queues • Synchronized communication channels • Producer awaits explicit ACK from consumer • Theory and practice of concurrency • Implementation of language synch. primitives (CSP handoff, Ada rendezvous) • Message passing software • Java.util.concurrent.ThreadPoolExecutor Scherer & Lea

  7. Hanson’s Synch. Queue datum item; Semaphore sync(0), send(1), recv(0); datum take() { void put(datum d) { recv.acquire(); send.acquire(); datum d = item; item = d; sync.release(); recv.release(); send.release(); sync.acquire(); return d; } } Scherer & Lea

  8. Hanson’s Synch. Queue datum item; Semaphore sync(0), send(1), recv(0); datum take() { void put(datum d) { recv.acquire(); send.acquire(); datum d = item; item = d; sync.release(); recv.release(); send.release(); sync.acquire(); return d; } } Scherer & Lea

  9. Hanson’s Queue: Limitations • High overhead • 3 semaphore operations for put and take • Interleaved handshaking – likely to block • No obvious path to timeout support • Needed e.g. for j.u.c.ThreadPoolExecutor adaptive thread pool • Producer adds a worker or runs task itself • Consumer terminates if work unavailable Scherer & Lea

  10. Java 5 Version • Fastest known previous implementation • Optional FIFO fairness • Unfair mode stack-based  better locality • Big performance penalty for fair mode • Global lock covers two queues • (stacks for unfair mode) • One each for awaiting consumers, producers • At least one always empty Scherer & Lea

  11. Remainder of Part I • Introduction • Nonblocking Synchronization • Why use? • Nonblocking partial methods • Synchronous Queue Design • Conclusions Scherer & Lea

  12. Nonblocking Synchronization • Resilient to failure or delay of any thread • Optimistic update pattern: • Set-up operation (invisible to other threads) • Effect all at once (atomic) • Clean-up if needed (can be done by any thread) • Atomic compare-and-swap (CAS) boolCAS(word *ptr, word e, word n) { if(*ptr != e)returnfalse; *ptr = n; returntrue; } Scherer & Lea

  13. Why Use Nonblocking Synch? • Locks • Performance (convoying, intolerance of page faults and preemption) • Semantic (deadlock, priority inversion) • Conceptual (scalability vs. complexity) • Transactional memory • Needs to support the general case • High overheads (currently) Scherer & Lea

  14. Ad Hoc NBS Fine Locks Programmer Effort HW TM Software TM (STM) Coarse Locks Canned NBS System Performance Scherer & Lea

  15. Linearizability [HW90] • Gold standard for correctness • Linearization Point where operations take place T3: Dequeue (a) T1: Enqueue (a) T2: Enqueue (b) T4: Dequeue (b) Time flows left to right Scherer & Lea

  16. Linearizability [HW90] • Gold standard for correctness • Linearization Point where operations take place T3: Dequeue (b!) T1: Enqueue (a) T2: Enqueue (b) T4: Dequeue (a!) Time flows left to right Scherer & Lea

  17. Partial Operations • Totalized approach: return failure • Repeat until data retrieved (“try-in-a-loop”) • Heavy contention on data structures • Output depends on which thread retries first T3: Enqueue (a) T4: Enqueue (b) T1: Dequeue (b!) T2: Dequeue (a!) Scherer & Lea

  18. Dual Linearizability Break partial methods into two first-class halves: pre-blocking reservation, post- blocking follow-up T3: Enqueue (a) T4: Enqueue (b) T1: Dequeue (a) T2: Dequeue (b) Scherer & Lea

  19. Next Up: Synchronous Queues • Introduction • Nonblocking Synchronization • Synchronous Queue Design • Implementation • Performance • Conclusions Scherer & Lea

  20. Algorithmic Genealogy Fair mode Unfair mode M&S Queue Treiber’s Stack Source Algorithm Dual Queue Dual Stack Consumer Blocking Fair SQ Producer Blocking, Timeout, Cleanup Unfair SQ Scherer & Lea

  21. Algorithmic Genealogy Fair mode Unfair mode M&S Queue Treiber’s Stack Source Algorithm Dual Queue Dual Stack Consumer Blocking Fair SQ Producer Blocking, Timeout, Cleanup Unfair SQ Scherer & Lea

  22. M&S Queue: Enqueue Queue Tail Head E1 Dummy Data Data Data Data Data Queue E2 Head Tail Dummy Data Data Data Data Data Scherer & Lea

  23. M&S Queue: Dequeue Queue Tail Head Dummy Data Data Data Data D1 Queue D2 Tail Head Old Dummy New Dummy Data Data Data Scherer & Lea

  24. The Dual Queue • Separate data, request nodes (flag bit) • queue always data or requests • Same behavior as M&S queue for data • Reservations are antisymmetric to data • dequeue enqueues a reservation node • enqueue satisfies oldest reservation • Tricky consistency checks needed • Dummy node can be datum or reservation • Extra state to watch out for (more corner cases) Scherer & Lea

  25. Dual Queue: Enq. (Requests) Queue E3 Tail Head Dummy Res. Res. Res. Res. E1 E2 Read dummy’s next ptr E1 CAS reservation’s data ptr from nil to satisfying data E2 E3 Update head ptr Scherer & Lea

  26. Dual Queue: Enq. (Requests) Queue E3 Tail Head Dummy Res. Res. Res. Res. E2 Item Read dummy’s next ptr E1 CAS reservation’s data ptr from nil to satisfying data E2 E3 Update head ptr Scherer & Lea

  27. Dual Queue: Enq. (Requests) Queue E3 Tail Head Old Dummy New Dummy Res. Res. Res. Item Read dummy’s next ptr E1 CAS reservation’s data ptr from nil to satisfying data E2 E3 Update head ptr Scherer & Lea

  28. Synchronous Queue • Implementation extends dual queue • Consumers already block for producers • add blocking for the “other direction” • Add item ptr to data nodes • Consumers CAS from nil to “satisfying request” • Once non-nil, any thread can update head ptr • Timeout support • Producer CAS from nil back to self • Node reclaimed when it reaches head of queue: seen as fulfilled node Scherer & Lea

  29. The Test Environments • SunFire 6800 • 16 UltraSparc III processors @ 1.2 GHz • SunFire V40z • 4 AMD Opteron processors @ 2.4 GHz • Java SE 5.0 HotSpot JVM • Microbenchmark performance tests Scherer & Lea

  30. Synchronous Queue Performance 16 processor SunFire 6800 14X difference Scherer & Lea

  31. ThreadPoolExecutor Impact 16 processor SunFire 6800 10X difference Scherer & Lea

  32. Next Up: Conclusions • Introduction • Nonblocking Synchronization • Synchronous Queue Design • Conclusions Scherer & Lea

  33. Conclusions • Low-overhead synchronous queues • Optional FIFO fairness • Fair mode extends dual queue • Unfair mode extends dual stack • No performance penalty • Up to 14x performance gain in SQ • Translates to 10x gain for TPE Scherer & Lea

  34. Future Work: Types of Scalability • Constant overhead for operations, irrespective of the number of threads • “Low-level” – doesn’t hurt scalability of apps • Spin locks (e.g. MCS), SQ • Overall throughput proportional to the number of concurrent threads • “High-level” – data structure itself • Can be obtained via elimination [ST95] • Stacks [HSY04]; queues [MNSS05]; exchangers Scherer & Lea

  35. Part II: Practice • Thread Creation Patterns • Loops, oneway messages, workers & pools • Executorframework • Advanced Topics • AbstractQueuedSynchronizer Scherer & Lea

  36. Autonomous Loops • Simple non-reactive active objects contain arun loop of form: • public void run() {while (!Thread.interrupted()) doSomething();} • Normally established with a constructor containing: • new Thread(this).start(); • Or by a specificstartmethod • Perhaps also setting priority and daemon status • Normally also support other methods called from other threads • Requires standard safety measures • Common Applications • Animations, Simulations, Message buffer Consumers, Polling daemons that periodically sense state of world Scherer & Lea

  37. Thread Patterns for Oneway Messages Scherer & Lea

  38. Thread-Per-Message Web Server class UnstableWebServer { public static void main(String[] args) { ServerSocket socket = new ServerSocket(80); while (true) { final Socket connection = socket.accept();Runnable r = new Runnable() { public void run() {handleRequest(connection); } };new Thread(r).start(); } }} • Potential resource exhaustion unless connection rate is limited • Threads aren’t free! • Don’t do this! Scherer & Lea

  39. Thread-Per-Object via Worker Threads • Establish a producer-consumer chain • Producer • Reactive method just placesmessagein achannel • Channelmight be a buffer, queue, stream, etc • Messagemight be aRunnable command, event, etc • Consumer • Host contains an autonomous loop thread of form: • while (!Thread.interrupted()) { m = channel.take(); process(m); } • Common variants • Pools • Use more than one worker thread • Listeners • Separate producer and consumer in different objects Scherer & Lea

  40. Web Server Using Worker Thread class WebServer {BlockingQueue<Socket> queue = new LinkedBlockingQueue<Socket>();class Worker extends Thread { public void run() { while(!Thread.interrupted()) {Socket s = queue.take(); handleRequest(s); } } } public void start() {new Worker().start(); ServerSocket socket = new ServerSocket(80); while (true) { Socket connection = socket.accept(); queue.put(connection); } } public static void main(String[] args) { new WebServer().start(); }} Scherer & Lea

  41. Channel Options • Unbounded queues • Can exhaust resources if clients faster than handlers • Bounded buffers • Can cause clients to block when full • Synchronous channels • Force client to wait for handler to complete previous task • Leaky bounded buffers • For example, drop oldest if full • Priority queues • Run more important tasks first • Streams or sockets • Enable persistence, remote execution • Non-blocking channels • Must take evasive action if put or take fail or time out Scherer & Lea

  42. Thread Pools • Use a collection of worker threads, not just one • Can limit maximum number and priorities of threads • Dynamic worker thread management • Sophisticated policy controls • Often faster than thread-per-message for I/O bound actions Scherer & Lea

  43. Web Server Using Executor Thread Pool • Executor implementations internalize the channel class PooledWebServer {Executor pool = Executors.newFixedThreadPool(7); public void start() { ServerSocket socket = new ServerSocket(80); while (!Thread.interrupted()) { final Socket connection = socket.accept();Runnable r = new Runnable() { public void run() {handleRequest(connection); } };pool.execute(r); } } public static void main(String[] args) { new PooledWebServer().start(); }} Scherer & Lea

  44. Policies and Parameters for Thread Pools • The kind of channel used as task queue • Unbounded queue, bounded queue, synchronous hand-off, priority queue, ordering by task dependencies, stream, socket • Bounding resources • Maximum number of threads • Minimum number of threads • “Warm” versus on-demand threads • Keepalive interval until idle threads die • Later replaced by new threads if necessary • Saturation policy • Block, drop, producer-runs, etc • These policies and parameters can interact in subtle ways! Scherer & Lea

  45. Pools in Connection-Based Designs • For systems with many open connections (sockets), but relatively few active at any given time • Multiplex the delegations to worker threads via polling • Requires underlying support for select/poll and nonblocking I/O • Supported in JDK1.4 java.nio Scherer & Lea

  46. The Executor Framework • Framework for asynchronous task execution • Standardize asynchronous invocation • Framework to execute Runnable and Callable tasks • Runnable: void run() • Callable<V>: V call() throws Exception • Separate submission from execution policy • Use anExecutor.execute(aRunnable) • Not new Thread(aRunnable).start() • Cancellation and shutdown support • Usually created viaExecutors factory class • Configures flexible ThreadPoolExecutor • Customize shutdown methods, before/after hooks, saturation policies, queuing Scherer & Lea

  47. Executor • Decouple submission policy from task execution • public interface Executor { void execute(Runnable command);} • Code which submits a task doesn't have to know in what thread the task will run • Could run in the calling thread, in a thread pool, in a single background thread (or even in another JVM!) • Executor implementation determines execution policy • Execution policy controls resource utilization, overload behavior, thread usage, logging, security, etc • Calling code need not know the execution policy Scherer & Lea

  48. ExecutorService • Adds lifecycle management • ExecutorService supports both graceful and immediate shutdown public interface ExecutorService extends Executor { void shutdown(); List<Runnable> shutdownNow(); boolean isShutdown(); boolean isTerminated(); boolean awaitTermination(long timeout, TimeUnit unit); // … } • Useful utility methods too • <T> T invokeAny(Collection<Callable<T>> tasks) • Executes the given tasks returning the result of one that completed successfully (if any) • Others involving Future objects—covered later Scherer & Lea

  49. Creating Executors • Sample Executor implementations from Executors • newSingleThreadExecutor • A pool of one, working from an unbounded queue • newFixedThreadPool(int N) • A fixed pool of N, working from an unbounded queue • newCachedThreadPool • A variable size pool that grows as needed and shrinks when idle • newScheduledThreadPool • Pool for executing tasks after a given delay, or periodically Scherer & Lea

  50. ThreadPoolExecutor • Numerous tuning parameters • Core and maximum pool size • New thread created on task submission until core size reached • New thread then created when queue full until maximum size reached • Note: unbounded queue means the pool won’t grow above core size • Maximum can be unbounded • Keep-alive time • Threads above the core size terminate if idle for more than the keep-alive time • Pre-starting of core threads, or else on demand Scherer & Lea

More Related