Lazy Binary-Splitting: A Run-Time Adaptive Work-Stealing Scheduler

Lazy Binary-Splitting: A Run-Time Adaptive Work-Stealing Scheduler Alexandros Tzannes, George C. Caragea, Rajeev Barua, Uzi Vishkin University of Maryland, College Park Tuesday Jan. 12th, 2010 PPoPP, Bangalore

Outline Introduction/Motivation Work-Stealing Background Lazy Binary Splitting Experimental Evaluation Conclusion Department of Computer Science, University of Maryland

We present an a dynamic scheduling algorithm of parallel tasks to cores based on work-stealing Target: Shared memory UMA Contribution: Our work-stealer adapts parallelism granularity to run-time conditions by avoiding the creation of excessive parallelism when the system is heavily loaded.

Why Dynamic Scheduling? Static Scheduling Easy E.g., split do-all iterations by number of threads Works well in some cases E.g., Similar amount of work per iteration Can Cause load-imbalance in other cases E.g., nested do-alls or load-imbalanced iterations Dynamic Scheduling More complex (some overheads) Compiler or programmer must worry about parallelism overheads & grain-size Great Load-Balance & Performance Even/Especially for irregular or nested parallelism Dynamic Coarsening of parallelism

Why Nested Parallelism ? void quicksort(int A[], int start, int end) { int pivot = partition ( A, start, end); spawn(0,1) { if ($==0) quicksort (A, start, pivot); else quicksort (A, pivot+1, end); } } • Ease-of-Programming • It occurs naturally in many programs • E.g., divide-and-conquer (irregular parallelism) • Outer parallelism doesn’t create enough parallelism • Outer parallelism creates load-imbalanced threads • Modularity: a programmer should be able to call a function creating parallelism from sequential or parallel contexts alike.

Outline Introduction/Motivation Work-Stealing Background Lazy Binary Splitting Experimental Evaluation Conclusion Department of Computer Science, University of Maryland 6

Work-Stealing Background Cores Deques Shared Work-Pool Task Descriptors • Parallel do-all loops can introduce a huge number of potentially fine-grain iterations • A Task Descriptor (TD) is a wrapper that contains multiple fine-grain iterations

Work-Stealing Scheduling Work-Stealing Overview: Scales Good Locality [Ackar et. al., The data locality of work stealing] Good memory footprint [ ≤P S1 ] Provably efficient [ ] Low synchronization overheads Stealing Phase: randomized probing from all idle processors

Eager Binary Splitting (EBS) Focusing now on parallel do-alls… When a TD with niterations is created, it is recursively split: TDs with n/2, n/4, . . . , 1iterations are pushed on the deque It may not be profitable to split all the way down to 1 iteration Splits  deque transactions  memory fences: Expensive Performance Degradation

EBS’s splitting: Overall View 1 2 Width/Depth N: 4096 iterations 64 Cores 1-4K 1 1 2 2 2K-4K 1-2K 1-64 1985-2048 2049-2112 4033-4096 64 7 1-2 N/2 logN N logN+1 Excessive Splitting = Performance Loss Time

Solutions to Excessive Splitting To reduce excessive splitting TBB offers two options Simple Partitioner (SP) Auto Partitioner (AP)

#1: Simple Partitioner EBS (SP) Stop splitting a TD if it contains fewer than sst (stop-splitting-threshold) iterations. i.e., combine sst iterations TD.#it>TD.sst Execute Remaining Threads NO YES Split TD; Place ½ on deque

SP’s splitting: Overall View 1 2 Width/Depth N: 4096 iterations 64 Cores sst: 2 1-4K 1 1 2 2 2K-4K 1-2K 1-64 1985-2048 2049-2112 4033-4096 64 7 1-2 N/2 logN N logN+1 Time

What determines a good sst ? Small enough so that enough parallelism is created Keep all processors busy Load Balance Large enough to avoid excessive splitting Goal: find a happy medium How?

TBB: Suggested Procedure for Determining sst TBB calls sst “grain-size” This is verbatum fromTBB's reference manual: Set the grainsize parameter to 10,000. This value is high enough toamortize scheduler overhead sufficiently for practically all loop bodies,but may unnecessarily limit parallelism. Run your algorithm on one processor. Start halving the grainsize parameter and see how much the algorithmslows down as the value decreases. A slowdown of about 5-10% is a good setting for most purposes.

SP's Grainsize Drawbacks SP's suggested procedure for determining sst is: Manual for each do-all loop and requires multiple re-executions Not performance portable To different datasets To different platforms Not adaptive to context E.g. executing a do-all creating 10,000 iterations from the original serial thread vs. from a nested context. Fixed Grain size

Summary of SP problems Manual sst Tedious, hurts productivity Fixed sst Code is not performance portable Excessive Splitting Performance penalty

#2: Auto Partitioner EBS (AP) When a TD with N iterations is created we want it split into enough TDs to create enough parallelism, but not too much: Split it (recursively) into K*P chunks P is the number of cores, K is a small constant TD.chunks>1 Execute Remaining Threads NO YES Split TD; Place ½ on deque

AP’s splitting: Overall View N: 4096 iterations P: 64 Cores Chunks=2*P (K=2) Width/Depth 1-4K 1 1 2 2 2K-4K 1-2K 1-64 64 7 1-32 128 8 N logN+1 Time

Comparing AP to SP No manual sst (grain size) Performance Portable: To Different Datasets (somewhat) Not for small amounts of parallelism To Different Platforms (#cores) …but AP is NOT adaptive to context For n levels of nesting (K*P)n TDs created Excessive Splitting!

Outline Introduction/Motivation Work-Stealing Background Lazy Binary Splitting Experimental Evaluation Conclusion Department of Computer Science, University of Maryland 21

Our Approach: Lazy Binary Splitting (LBS) 1st Insight: It is unlikely to be profitable to split a TD to create more work for other cores to steal if they are busy! How do we know if others are busy? 2nd Insight: We can check if the local deque of a processor is empty as an approximation to if other processors are likely to be busy. deque non-empty  TD not stolen others busy LBS: if the local deque is not empty postpone splitting by executing a few (ppt) iterations of a TD locally, then check again. Runtime granularity adaptation based on load

Our Approach: Lazy Binary Splitting (LBS)[2] The profitable parallelism threshold (ppt) ensures that extremely fine grain parallelism is coarsened ppt also ensures deque checks are not performed too frequently which could harm performance ppt is compiler determined (statically) TD.#it>TD.ppt Execute Remaining Iterations NO YES Is deque Empty? Split TD; Place ½ on deque Execute ppt Iterations YES NO

LBS: High Level Picture 49-56 57-64 Width/Depth N: 4096 threads P: 64 Cores ppt: 8 grain size 1-4K 1 1 2 2 2K-4K 1-2K 1-64 1985-2048 2049-2112 4033-4096 64 7 128 8 1-32 128 9 33-48 128 10

Why LBS is better Automatic Performance Portable Across datasets Across platforms Adaptive to context Adapts granularity of parallelism based on load during execution

Outline Introduction/Motivation Work-Stealing Background Lazy Binary Splitting Experimental Evaluation Summary Department of Computer Science, University of Maryland 26

Evaluation Platform: XMT XMT’s Goals: Ease-of-Programming Easy workflow from PRAM algorithm to XMT program (taught to undergrads, high school) Backwards compatibility with serial code 1 Heavy Core for serial core (Master TCU) A plurality of simple parallel cores (TCUs) Good Performance High-Bandwidth Low-Latency Interconnect Efficient HW scheduling of outer parallelism Fast & Scalable global synchronization

Evaluation Platform We chose XMT because it: Is easy-to-program Productivity is important for general-purpose parallelism Does Not compromise on performance Has more than a few (4 or 8) cores Allows to demonstrate the scalability of LBS We ran our benchmarks on a 75MHz XMT FPGA prototype: 64 TCUs in 8 clusters, 8 shared 32K L1 cache modules, one mult/divunit per cluster. 4 prefetch buffers per TCU, and 32 integer registers. No floating-point support.

Benchmarks Used

Comparing LBS to AP APdefault APXMT LBS

Comparing LBS to AP:Results Overall: LBS is 16.2% faster than APxmt LBS is 38.9% faster than APdefault LBS is faster on fine-grain iterations FW, bfs, SpMV LBS & AP are comparable on the rest

Comparing LBS to SP SP needs manual tuning (sst) LBS doesn’t Two Training Scenarios: Common: SPtr/ex Train on one dataset, execute on different one Uncommon: SPex/ex Train and execute on the same dataset The gap between SPtr/ex and SPex/ex shows SP’s lack of performance portability across datasets.

Comparing LBS to SP

LBS vs SP: Results Overall (vs SPtr/ex): LBS is 19.5% faster on average Up to 65.7% faster (SpMV) Falls behind only on tsp and only by 2.2% Overall (vs SPex/ex): LBS 3.8% faster on average Only falls behind on tsp (by 2.2%) So even in the unrealistic case LBS is preferable

Additional Comparisons Experimental Comparisons vs other work-stealing algorithms SWS, EBS1, LBS1 Serializing inner Parallelism Quantitative Comparison of schedulers in terms of # deque transactions # synchronization points needed …read the paper ! 

LBS’s Scalability:Speedups vs 1 TCU Super-linear speedups explained by complex cache behavior Average Speedup is Linear : LBS is Scalable

Performance Benefits:Speedups vs. Serial on MTCU Good speedups even for irregular benchmarks qs, tsp, queens, bfs XMT & LBS is a promising combination

Conclusions LBS significantly reduces splitting overheads and delivers superior performance The combination of XMT & LBS seems promising for general-purpose parallel computing How will LBS perform on traditional multi-cores?

Questions ?

Lazy Binary-Splitting: A Run-Time Adaptive Work-Stealing Scheduler