1 / 26

Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer. Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa Department of Information Science, Faculty of Science, University of Tokyo. Background.

sammy
Download Presentation

Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, Akinori Yonezawa Department of Information Science, Faculty of Science, University of Tokyo

  2. Background • “Irregular” parallel applications • Tasks are not identified until runtime • synchronization structure is complicated • Languages with fine-grain threads • promising approach to handle the complexity

  3. Motivation Q: Are fine-grain threads really effective? • Easy to describe irregular parallelism? • Scalable? • Fast? Many sophisticated designs and implementation techniques have been proposed so far, but Case studies to answer the Q are few

  4. in terms of • program description cost • speed on 1 PE • scalability on 64PE SMP Goal • Case study to better understandthe effectiveness of fine-grain threads C + Solaris threads our language Schematic approach w/o fine-grain threads approach with fine-grain threads VS.

  5. Overview • Applications ( RNA & CKY ) • Solutions without fine-grain threads • Solutions with fine-grain threads • Performance evaluation

  6. Case Study 1: RNA- protein secondary structure prediction - • finding a path • satisfying certain condition • with largest weight Algorithm simple node traversal + pruning unbalanced tree

  7. Case Study 2: CKY- context-free grammar parser - She is a girl whose mother is a teacher. actual size ≒ 100 calculation of matrix elements depends on all s calculation time significantly varies from element to element

  8. large overhead communication with memory Task Pool P P P Solution without Fine-grain Threads(RNA) To create a thread for each node

  9. how to implement? • small delay → simple spin • large delay → block wait decision strategy? P P P • trial & error • prediction Solution without Fine-grain Threads(CKY ) calculating 1 element → 0 ~ 200 synchronization

  10. Language with Fine-grain Threads • Schematic [Taura et. al 96] = Scheme + future + touch[Halstead 85] (define (fib x) (if (< x 2) 1 (let ((r1 (future (fib (- x 1)))) (r2 (future (fib (- x 2))))) (+ (touch r1) (touch r2))))) thread creation channel synchronization

  11. future future future future future future future future future future future future Thread Management in Schematic • Lazy Task Creation [Mohr et al. 91] stack PE A PE B

  12. register memory register register memory register register memory register register register register Synchronization on Register • StackThreads [Taura 97] PE A PE B

  13. work A simple spin if (r has value) { } else { c = closure(cont, fv1, ...); put_closure(r, c); /* switch to another work */ ... } cont(c, v) { } work B ver. 1; + heuristics to decide which to duplicate work B ver. 2; block wait Synchronization by Code Duplication work A (touch r) work B

  14. Schematic C + thread What description can be omittedin Schematic? • Management of fine-grain tasks • Synchronization details future ⇔ manipulation of task pool + load balance touch ⇔ manipulation of comm. medium + aggressive optimizations

  15. Codes for Parallel Execution Schematic C int search_node(...) { if (condition) { } else { child = ...; ... search_node(...); ... ... ... } (define (search_node) (if condition ‘done (let ((child ..)) ... ... (search_node) ... ... ...))) RNA for parallel execution whole: 1566 lines whole: 453 lines parallel: 537 lines (34 %) parallel: 29 lines (6.4 %)

  16. Performance Evaluation(Condition) • Sun Ultra Enterprise 10000(UltraSparc 250MHz × 64) • Solaris 2.5.1 • Solaris thread (user-level thread) • GC time not included • Runtime type check omitted

  17. Performance Evaluation(Sequential)

  18. Performance Evaluation(Parallel)

  19. Related Work • ICC++ [Chien et al. 97] • Similar study using 7 apps • Experiments on distributed memory machines • Focus on • namespace management • data locality • object-consistency model

  20. Conclusion • We demonstrated the usefulness of fine-grain multithread languages • Task pool-like execution with simple description • Aggressive optimizations for synchronization • We showed the experimental results • A factor of 2.8 slower than C • Scalability comparable to C

  21. Performance Evaluation(Other Applications 1/2)

  22. Performance Evaluation(Other Applications 2/2)

  23. Identifying Overheads

More Related