Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches Eric Freudenthal and Allan Gottlieb {freudenthal, gottlieb}@nyu.edu

Talk Summary • Review Ultracomputer combining networks • MIMD architecture expected to provide high performance for hot spot traffic & centralized coordination • Duplicating & debunking • High hot spot latency, slow centralized coord. • Why? • Minor improvements to architecture • Significantly reduced hot spot latency • Improved coordination performance

23 PE computer with omega network PE7 PE6 PE2 PE5 PE3 PE1 PE0 PE4 MM7 MM6 MM1 MM0 MM2 MM3 MM4 MM5 SW SW SW SW SW SW SW SW SW SW SW SW NUMA Connections MemoryModules ProcessingElements Switches Routing: 20 21 22 “Dance Hall” (All Processors equally distant from all Memory Modules) “Budoir” (Processors & Memory Modules can be co-resident)

Network congestion due to polling of single variable in MM3 PE7 PE6 PE5 PE4 PE3 PE2 PE0 MM1 MM7 MM6 MM0 MM4 MM3 MM5 MM2 SW SW SW SW SW SW SW SW SW SW SW PE1 SW • Each PE has single outstanding reference to same variable. • Low offered load • These references serialize at MM3 • Switch queues in “funnel” near MM3 (in red) fill • High memory latency results • If switches could “combine” references to a single variable • A single MM operation would satisfy multiple requests • Lower network congestion & latency • NYU Ultracomputer does this

Fetch-and-add FAA(X, e) • Atomic operation • Fetches old value of X and adds e to X • Useful for busy waiting coordination • Ultracomputer switches combine FAAs • FAA(X,0) is equivalent to load X

Combining of Fetch & Add (and loads) FAA(X,1) X:12 FAA(X,15) FAA(X,3) FAA(X,2) 1U X:12 X:13 X:0 FAA(X,12) FAA(X,4) 4U X:0 X:0 FAA(X,8) X:4 Semantics equivalent to some serialization. 3º Start: X=0 4º 12L End: X=15 1º MM 2º “wait buffer”Lower port first, its addend=12

Coordination with fetch-and-add Spin-locks: Shared int L = 1 lock(): while (faa(L,-1) < 1) faa(L,+1) while (L < 1) ; unlock(): faa(L,+1) Readers and Writers: constant int p = max readers Shared int C = p // p resources Reader() { // take 1 instance while (faa(C,-1) < 1) faa(C,+1) while (C < 1); read() faa(C,+1) Writer() // take all p instances while (faa(C,-p) < p) faa(C,+p) while (C< p); write() faa(C,+p)

Characteristics of FAA Centralized Coordination Algorithms • Many faa coord algs reference a small number of shared variables. • Spin-locks and r/w reference one • Uncontended spin- and r/w-lock generates one shared access • Including multiple readers in absence of writers • FAA barrier and queue algorithms have similar characteristics

Combining Queue Design chute chute chute chute in in in in out out out out in out in in out out in out ALU ALU ALU ALU No associative memory required Background: Guibas & Liang Systolic FIFO Ultracomputer Combining Queue

Summary of Baseline Ultracomputer • Architecture reasonable and motivated • Switches not prohibitively expensive • Serialization-free coordination algorithms • Queues in switches permit high bandwidth • Low latency for random & mixed hot spot traffic • NYU simulations (surprisingly) did not include 100% hot spot traffic • (Lee Kruskal Kuck did, but with different flow control) • In fact combining helpful, but not as good as expected • Queues near hot memory fill; others nearly empty • Non-trivial queuing delays • Combining only in full queues • Low message “multiplicity”

Rest of this talk • Debunking: High latency despite Ultra3 flow control • Algorithms that minimize hot spot traffic outperform centralized. • Deconstructing: Understanding of high latency • Reduced combining due to wait buffer exhaustion • Queuing delays in network – reduced Q capacity helps • Debugging: Improvements to combining switches • Larger wait buffer needed • Adaptive reduction of queue capacity when combining occurs • Duplication: Centralized algorithms competitive • Much superior for concurrent-access locks

Ultra III “baseline” switchesMemory Latency, one request / PE 100%, no combining 100% ~4x 40% ~2x 20% 0-10% ideal % hot spot

Two “Fixes” to Ultra III Switch Design • Problem: Full wait buffers reduce combining • “Sufficient” waitbuf capacity → 45% latency reduction • Problem: Congestion in “combining funnel” • Shortened queues → backpressure • Lower per-stage queuing delays • More non-empty queues • more combining, hence higher message “multiplicity” • Reduces latency another 30%; • FAA algs now competitive

What is the “Best” queue length • Problem • Non-hot spot latency benefits fromlargequeues • Hot-spot latency benefits fromsmallqueues • Solution • Detect switches engaged in combining • Multiple combined messages awaiting transmission • Adaptively reduce capacity of these switches • Other switches unaffected • Results • Reduced polling latency, good non-poll latency

Memory latency, 1024 PE SystemsOver a range of accepted load • Baseline Ultra III switch • Limited wait buffer • Fixed queue size • Waitbuf100 • Baseline • Sufficient wait buffer • Improved • Waitbuf100 • Adaptive queue length • Aggressive • Improved • Combines from both ports & on first slice • Potential clock rate reduction 100% hot 20% hot Uniform

Mellor-Crummey & Scott (MCS):Local-spin coordination • No hot spot polling • Each PE spins on distinct shared var in co-located MM • Other parts of algorithm may generate hot spot traffic • Serialization-free barriers • Barrier satisfaction “disseminated” without generating hotspot traffic • Each processor has log2(N) rendezvous • Locks: Global state in hot spot variables • Heads of linked lists (blocked requestors) • Count of readers • Hot spot accesses benefit from combining

Synchronization: BarriersMCS also serialization-free IntenseLoop: barrier RealisticLoop: Ref 15 or 30 shared vars barrier Better

Reader-Writer Experiment • Loop: Determine if reader or writer “Sleep” for 100 cycles Lock Reference 10 shared variables Unlock • Reader-writer mix • All reader, all writer • 1 expected writer • P(writer) = 1/N • Plots on next slides • Rate readers and writer locks granted (unit=rate/kc) • Greater values indicate greater progress

All Readers / All Writers All Readers Combining helps MCS Serialization-free (FAA algorithm) faster All Writers Essentially a highly contended semaphore Only aggressive competes Better

1 Expected Writer Reader performance FAA faster MCS benefits from combining Writer performance FAA generally faster MCS benefits from combining Better

Conclusions • “Improved” architecture superior • Large wait buffers decrease hot spot latency • Adaptive Q capacity decreases latency • General technique? • Performance of FAA Algorithms • R/W competitive with MCS • Much superior when readers dominate • Require combining. • Barrier near MCS • Faster with aggressive design

Relevance & Future Work • Large shared memory systems are manufactured • Combining not restricted to omega network • Return messages must be routed to combine sites • Combining demonstrated as useful for inter-process coordination. • Application of adaptive queue capacity modulation to other domains • Such as responding to flash-flood & DOS traffic • Analytic model of queuing delays for hot spot combining under development

Difficulties with aggressive(2-input, coupled) queues Single input queues simpler Dual input combining queue built from two single-input combining queues Messages from different ports ineligible for combining DecoupledALUs Idea: remove ALU from transmission path Shorter clock intervals max(transmission, ALU) Head item can not combine Combining less likely ≥ 3 enqueued messages ALU ALU ALU ALU mux

END • Questions?

Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

Presentation Transcript

Performance analysis for high speed switches

Debunking the DaVinci Code

Performance and Production Debugging

Duplicating Yourself

Debunking the Bible Code

Performance instruction for switches and sockets

Duplicating Genetic Information

Central Duplicating

DEBUNKING THE MYTHS

Debunking the Spelling Myths

Debunking then Duplicating Ultracomputer Performance Claims by Debugging the Combining Switches

Duplicating a Segment

The wonderful duplicating “machine”

Connexium Managed Switches - Performance Optimization

Our claims performance for 2009

Performance Analysis and Debugging Tools

Performance Engineering and Debugging HPC Applications

Debunking “ THE ” scientific method

Duplicating Audio CD’s

Duplicating Your Workspace

Performance Debugging Techniques For HPC Applications

Claims Performance