390 likes | 626 Views
Lazy Binary-Splitting: A Run-Time Adaptive Work-Stealing Scheduler. Alexandros Tzannes, George C. Caragea, Rajeev Barua, Uzi Vishkin University of Maryland, College Park Tuesday Jan. 12 th , 2010 PPoPP, Bangalore. Outline. Introduction/Motivation Work-Stealing Background
E N D
Lazy Binary-Splitting: A Run-Time Adaptive Work-Stealing Scheduler Alexandros Tzannes, George C. Caragea, Rajeev Barua, Uzi Vishkin University of Maryland, College Park Tuesday Jan. 12th, 2010 PPoPP, Bangalore
Outline Introduction/Motivation Work-Stealing Background Lazy Binary Splitting Experimental Evaluation Conclusion Department of Computer Science, University of Maryland
We present an a dynamic scheduling algorithm of parallel tasks to cores based on work-stealing Target: Shared memory UMA Contribution: Our work-stealer adapts parallelism granularity to run-time conditions by avoiding the creation of excessive parallelism when the system is heavily loaded.
Why Dynamic Scheduling? Static Scheduling Easy E.g., split do-all iterations by number of threads Works well in some cases E.g., Similar amount of work per iteration Can Cause load-imbalance in other cases E.g., nested do-alls or load-imbalanced iterations Dynamic Scheduling More complex (some overheads) Compiler or programmer must worry about parallelism overheads & grain-size Great Load-Balance & Performance Even/Especially for irregular or nested parallelism Dynamic Coarsening of parallelism
Why Nested Parallelism ? void quicksort(int A[], int start, int end) { int pivot = partition ( A, start, end); spawn(0,1) { if ($==0) quicksort (A, start, pivot); else quicksort (A, pivot+1, end); } } • Ease-of-Programming • It occurs naturally in many programs • E.g., divide-and-conquer (irregular parallelism) • Outer parallelism doesn’t create enough parallelism • Outer parallelism creates load-imbalanced threads • Modularity: a programmer should be able to call a function creating parallelism from sequential or parallel contexts alike.
Outline Introduction/Motivation Work-Stealing Background Lazy Binary Splitting Experimental Evaluation Conclusion Department of Computer Science, University of Maryland 6
Work-Stealing Background Cores Deques Shared Work-Pool Task Descriptors • Parallel do-all loops can introduce a huge number of potentially fine-grain iterations • A Task Descriptor (TD) is a wrapper that contains multiple fine-grain iterations
Work-Stealing Scheduling Work-Stealing Overview: Scales Good Locality [Ackar et. al., The data locality of work stealing] Good memory footprint [ ≤P S1 ] Provably efficient [ ] Low synchronization overheads Stealing Phase: randomized probing from all idle processors
Eager Binary Splitting (EBS) Focusing now on parallel do-alls… When a TD with niterations is created, it is recursively split: TDs with n/2, n/4, . . . , 1iterations are pushed on the deque It may not be profitable to split all the way down to 1 iteration Splits deque transactions memory fences: Expensive Performance Degradation
EBS’s splitting: Overall View 1 2 Width/Depth N: 4096 iterations 64 Cores 1-4K 1 1 2 2 2K-4K 1-2K 1-64 1985-2048 2049-2112 4033-4096 64 7 1-2 N/2 logN N logN+1 Excessive Splitting = Performance Loss Time
Solutions to Excessive Splitting To reduce excessive splitting TBB offers two options Simple Partitioner (SP) Auto Partitioner (AP)
#1: Simple Partitioner EBS (SP) Stop splitting a TD if it contains fewer than sst (stop-splitting-threshold) iterations. i.e., combine sst iterations TD.#it>TD.sst Execute Remaining Threads NO YES Split TD; Place ½ on deque
SP’s splitting: Overall View 1 2 Width/Depth N: 4096 iterations 64 Cores sst: 2 1-4K 1 1 2 2 2K-4K 1-2K 1-64 1985-2048 2049-2112 4033-4096 64 7 1-2 N/2 logN N logN+1 Time
What determines a good sst ? Small enough so that enough parallelism is created Keep all processors busy Load Balance Large enough to avoid excessive splitting Goal: find a happy medium How?
TBB: Suggested Procedure for Determining sst TBB calls sst “grain-size” This is verbatum fromTBB's reference manual: Set the grainsize parameter to 10,000. This value is high enough toamortize scheduler overhead sufficiently for practically all loop bodies,but may unnecessarily limit parallelism. Run your algorithm on one processor. Start halving the grainsize parameter and see how much the algorithmslows down as the value decreases. A slowdown of about 5-10% is a good setting for most purposes.
SP's Grainsize Drawbacks SP's suggested procedure for determining sst is: Manual for each do-all loop and requires multiple re-executions Not performance portable To different datasets To different platforms Not adaptive to context E.g. executing a do-all creating 10,000 iterations from the original serial thread vs. from a nested context. Fixed Grain size
Summary of SP problems Manual sst Tedious, hurts productivity Fixed sst Code is not performance portable Excessive Splitting Performance penalty
#2: Auto Partitioner EBS (AP) When a TD with N iterations is created we want it split into enough TDs to create enough parallelism, but not too much: Split it (recursively) into K*P chunks P is the number of cores, K is a small constant TD.chunks>1 Execute Remaining Threads NO YES Split TD; Place ½ on deque
AP’s splitting: Overall View N: 4096 iterations P: 64 Cores Chunks=2*P (K=2) Width/Depth 1-4K 1 1 2 2 2K-4K 1-2K 1-64 64 7 1-32 128 8 N logN+1 Time
Comparing AP to SP No manual sst (grain size) Performance Portable: To Different Datasets (somewhat) Not for small amounts of parallelism To Different Platforms (#cores) …but AP is NOT adaptive to context For n levels of nesting (K*P)n TDs created Excessive Splitting!
Outline Introduction/Motivation Work-Stealing Background Lazy Binary Splitting Experimental Evaluation Conclusion Department of Computer Science, University of Maryland 21
Our Approach: Lazy Binary Splitting (LBS) 1st Insight: It is unlikely to be profitable to split a TD to create more work for other cores to steal if they are busy! How do we know if others are busy? 2nd Insight: We can check if the local deque of a processor is empty as an approximation to if other processors are likely to be busy. deque non-empty TD not stolen others busy LBS: if the local deque is not empty postpone splitting by executing a few (ppt) iterations of a TD locally, then check again. Runtime granularity adaptation based on load
Our Approach: Lazy Binary Splitting (LBS)[2] The profitable parallelism threshold (ppt) ensures that extremely fine grain parallelism is coarsened ppt also ensures deque checks are not performed too frequently which could harm performance ppt is compiler determined (statically) TD.#it>TD.ppt Execute Remaining Iterations NO YES Is deque Empty? Split TD; Place ½ on deque Execute ppt Iterations YES NO
LBS: High Level Picture 49-56 57-64 Width/Depth N: 4096 threads P: 64 Cores ppt: 8 grain size 1-4K 1 1 2 2 2K-4K 1-2K 1-64 1985-2048 2049-2112 4033-4096 64 7 128 8 1-32 128 9 33-48 128 10
Why LBS is better Automatic Performance Portable Across datasets Across platforms Adaptive to context Adapts granularity of parallelism based on load during execution
Outline Introduction/Motivation Work-Stealing Background Lazy Binary Splitting Experimental Evaluation Summary Department of Computer Science, University of Maryland 26
Evaluation Platform: XMT XMT’s Goals: Ease-of-Programming Easy workflow from PRAM algorithm to XMT program (taught to undergrads, high school) Backwards compatibility with serial code 1 Heavy Core for serial core (Master TCU) A plurality of simple parallel cores (TCUs) Good Performance High-Bandwidth Low-Latency Interconnect Efficient HW scheduling of outer parallelism Fast & Scalable global synchronization
Evaluation Platform We chose XMT because it: Is easy-to-program Productivity is important for general-purpose parallelism Does Not compromise on performance Has more than a few (4 or 8) cores Allows to demonstrate the scalability of LBS We ran our benchmarks on a 75MHz XMT FPGA prototype: 64 TCUs in 8 clusters, 8 shared 32K L1 cache modules, one mult/divunit per cluster. 4 prefetch buffers per TCU, and 32 integer registers. No floating-point support.
Comparing LBS to AP APdefault APXMT LBS
Comparing LBS to AP:Results Overall: LBS is 16.2% faster than APxmt LBS is 38.9% faster than APdefault LBS is faster on fine-grain iterations FW, bfs, SpMV LBS & AP are comparable on the rest
Comparing LBS to SP SP needs manual tuning (sst) LBS doesn’t Two Training Scenarios: Common: SPtr/ex Train on one dataset, execute on different one Uncommon: SPex/ex Train and execute on the same dataset The gap between SPtr/ex and SPex/ex shows SP’s lack of performance portability across datasets.
LBS vs SP: Results Overall (vs SPtr/ex): LBS is 19.5% faster on average Up to 65.7% faster (SpMV) Falls behind only on tsp and only by 2.2% Overall (vs SPex/ex): LBS 3.8% faster on average Only falls behind on tsp (by 2.2%) So even in the unrealistic case LBS is preferable
Additional Comparisons Experimental Comparisons vs other work-stealing algorithms SWS, EBS1, LBS1 Serializing inner Parallelism Quantitative Comparison of schedulers in terms of # deque transactions # synchronization points needed …read the paper !
LBS’s Scalability:Speedups vs 1 TCU Super-linear speedups explained by complex cache behavior Average Speedup is Linear : LBS is Scalable
Performance Benefits:Speedups vs. Serial on MTCU Good speedups even for irregular benchmarks qs, tsp, queens, bfs XMT & LBS is a promising combination
Conclusions LBS significantly reduces splitting overheads and delivers superior performance The combination of XMT & LBS seems promising for general-purpose parallel computing How will LBS perform on traditional multi-cores?