260 likes | 365 Views
An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer. Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa Department of Information Science, Faculty of Science, University of Tokyo. Background.
E N D
An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa Department of Information Science, Faculty of Science, University of Tokyo
Background • “Irregular” parallel applications • Tasks are not identified until runtime • synchronization structure is complicated • Languages with fine-grain threads • promising approach to handle the complexity
Motivation Q: Are fine-grain threads really effective? • Easy to describe irregular parallelism? • Scalable? • Fast? Many sophisticated designs and implementation techniques have been proposed so far, but Case studies to answer the Q are few
in terms of • program description cost • speed on 1 PE • scalability on 64PE SMP Goal • Case study to better understandthe effectiveness of fine-grain threads C + Solaris threads our language Schematic approach w/o fine-grain threads approach with fine-grain threads VS.
Overview • Applications ( RNA & CKY ) • Solutions without fine-grain threads • Solutions with fine-grain threads • Performance evaluation
Case Study 1: RNA- protein secondary structure prediction - • finding a path • satisfying certain condition • with largest weight Algorithm simple node traversal + pruning unbalanced tree
Case Study 2: CKY- context-free grammar parser - She is a girl whose mother is a teacher. actual size ≒ 100 calculation of matrix elements depends on all s calculation time significantly varies from element to element
large overhead communication with memory Task Pool P P P Solution without Fine-grain Threads(RNA) To create a thread for each node
how to implement? • small delay → simple spin • large delay → block wait decision strategy? P P P • trial & error • prediction Solution without Fine-grain Threads(CKY ) calculating 1 element → 0 ~ 200 synchronization
Language with Fine-grain Threads • Schematic [Taura et. al 96] = Scheme + future + touch[Halstead 85] (define (fib x) (if (< x 2) 1 (let ((r1 (future (fib (- x 1)))) (r2 (future (fib (- x 2))))) (+ (touch r1) (touch r2))))) thread creation channel synchronization
future future future future future future future future future future future future Thread Management in Schematic • Lazy Task Creation [Mohr et al. 91] stack PE A PE B
register memory register register memory register register memory register register register register Synchronization on Register • StackThreads [Taura 97] PE A PE B
work A simple spin if (r has value) { } else { c = closure(cont, fv1, ...); put_closure(r, c); /* switch to another work */ ... } cont(c, v) { } work B ver. 1; + heuristics to decide which to duplicate work B ver. 2; block wait Synchronization by Code Duplication work A (touch r) work B
Schematic C + thread What description can be omittedin Schematic? • Management of fine-grain tasks • Synchronization details future ⇔ manipulation of task pool + load balance touch ⇔ manipulation of comm. medium + aggressive optimizations
Codes for Parallel Execution Schematic C int search_node(...) { if (condition) { } else { child = ...; ... search_node(...); ... ... ... } (define (search_node) (if condition ‘done (let ((child ..)) ... ... (search_node) ... ... ...))) RNA for parallel execution whole: 1566 lines whole: 453 lines parallel: 537 lines (34 %) parallel: 29 lines (6.4 %)
Performance Evaluation(Condition) • Sun Ultra Enterprise 10000(UltraSparc 250MHz × 64) • Solaris 2.5.1 • Solaris thread (user-level thread) • GC time not included • Runtime type check omitted
Related Work • ICC++ [Chien et al. 97] • Similar study using 7 apps • Experiments on distributed memory machines • Focus on • namespace management • data locality • object-consistency model
Conclusion • We demonstrated the usefulness of fine-grain multithread languages • Task pool-like execution with simple description • Aggressive optimizations for synchronization • We showed the experimental results • A factor of 2.8 slower than C • Scalability comparable to C