1 / 53

中華大學資訊工程學系 csie.chu.tw

Localization and Scheduling Techniques for Optimizing Communications on Heterogeneous Cluster Grid. Ching-Hsien Hsu ( 許慶賢 ). 中華大學資訊工程學系 http:// www.csie.chu.edu.tw. Outline. Introduction Regular / Irregular Data Distribution, Redistribution Category of Runtime Redistribution Problems

joefisher
Download Presentation

中華大學資訊工程學系 csie.chu.tw

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Localization and Scheduling Techniques for Optimizing Communications on Heterogeneous Cluster Grid Ching-Hsien Hsu (許慶賢) 中華大學資訊工程學系 http:// www.csie.chu.edu.tw

  2. Outline • Introduction • Regular / Irregular Data Distribution, Redistribution • Category of Runtime Redistribution Problems • Processor Mapping Technique for Communication Localization • The Processor Mapping Technique • Localization on Multi-Cluster Grid System • Scheduling Contention Free Communications for Irregular Problems • The Two-Phase Degree Reduction Method (TPDR) • Extended TPDR (E-TPDR) • Conclusions

  3. Introduction Regular Parallel Data Distribution • Data Parallel Programming Language, e.g. HPF (High Performance Fortran), Fortran D, • BLOCK, CYCLIC, BLOCK-CYCLIC(c) • Ex. 18 Elements Array, 3 Logical Processors

  4. Introduction (cont.) Data Distribution • Two Dimension Matrices

  5. Introduction (cont.) Data Redistribution

  6. Introduction (cont.) Data Redistribution REAL DIM(18, 24) :: A !HPF$ PROCESSORS P(2, 3) !HPF$ DISTRIBUTE A(BLOCK, BLOCK) ONTO P : (computation) !HPF$ REDISTRIBUTE A(CYCLIC, CYCLIC(2)) ONTO P : (computation)

  7. Introduction (cont.) Irregular Redistribution PARAMETER (S = /7, 16, 11, 10, 7, 49/) !HPF$ PROCESSORS P(6) REAL A(100), new (6) !HPF$ DISTRIBUTE A (GEN_BLOCK(S)) onto P !HPF$ DYNAMIC new = /15, 16, 10, 16, 15, 28/ !HPF$ REDISTRIBUTE A (GEN_BLOCK(new))

  8. Application … Algorithm P Algorithm Q … Data distribution for algorithm P 7 16 11 10 7 49 Data distribution for algorithm Q 15 16 10 16 15 28 Introduction (cont.) • Irregular Data Distribution (GEN_BLOCK) Heterogeneous Processors

  9. Introduction (cont.) Problem Category • Benefits of runtime redistribution • Achieve Data Locality • Reduce Communication cost at runtime • Objectives • Indexing sets generation • Data Packing & Unpacking Techniques • Communication Optimizations • Multi-Stage Redistribution Method • Processor Mapping Technique • Communication Scheduling

  10. Outline • Introduction • Regular / Irregular Data Distribution, Redistribution • Category of Runtime Redistribution Problems • Processor Mapping Technique for Communication Localization • The Processor Mapping Technique • Multi-Cluster Grid System • Contention Free Communication Scheduling for Irregular Problems • The Two-Phase Degree Reduction Method (TPDR) • Extended TPDR (E-TPDR) • Conclusions

  11. Processor Mapping Technique The Original Processor Mapping Technique (Prof. Lionel. M. Ni) • Mapping function is provided to generate a new sequence of logical processor id • Increase data hits • Minimize the amount of data exchange

  12. Processor Mapping Technique (cont.) An Optimal Processor Mapping Technique(Hsu’05) • Example: BC86 over 11 • Traditional Method • Size Oriented Greedy Matching • Maximum Matching (Optimal)

  13. Processor Mapping Technique (cont.) Localize communications • Cluster Grid • Interior Communication • External Communication

  14. Processor Mapping Technique (cont.) Motivating Example

  15. Processor Mapping Technique (cont.) Communication Table Before Processor Mapping |I|=9 |E|=18

  16. Processor Mapping Technique (cont.) Communication links Before Processor Mapping

  17. Processor Mapping Technique (cont.) Communication table after Processor Mapping |I|=27 |E|=0

  18. Processor Mapping Technique (cont.) Communication links after Processor Mapping

  19. Processor Mapping Technique (cont.) Processor Reordering Flow Diagram Partitioning Data Source Data Alignment/Dispatch Master Node SCA(x) Determine Target Cluster SCA(x) SCA(x) Generate new Pid SD(Px) Designate Target Node Reordering SD(Px’) DCA(x) Mapping Function DCA(x) F(X) = X’ = +(X mod C) * K DCA(x) DD(Py) Reordering Agent

  20. Processor Mapping Technique (cont.) Identical Cluster Grid vs. Non-identical Cluster Grid

  21. Processor Mapping Technique (cont.) Processor Replacement Algorithm for Non-identical Cluster Grid

  22. Processor Mapping Technique (cont.) Theoretical Analysis The number of interior communications when C=3.

  23. Processor Mapping Technique (cont.) Theoretical Analysis

  24. Processor Mapping Technique (cont.) Theoretical Analysis

  25. Processor Mapping Technique (cont.) Simulation Setting • Taiwan UniGrid • 8 campus clusters • SPMD Programs • C+MPI codes.

  26. National Tsing Hua University1 Academia Sinica National Tsing Hua University2 Taipei National Center for High-performance Computing Hsinchu Chung Hua University Providence University Hualien Tunghai University Taichung Hsing Kuo University National Dong Hwa University Tainan Processor Mapping Technique (cont.) Topology

  27. NCHC Dual AMD 2000+, 512M SINICA Dual Intel P3 1.0, 1G CHU Intel P4 2.8, 256M NTHU Dual Xeon 2.8, 1G THU Dual AMD 1.6, 1G Internet NDHU AMD Athlon, 256M PU AMD 2400+, 1G HKU Intel P3 1.0, 256M Processor Mapping Technique (cont.) Hardware Infrastructure

  28. Processor Mapping Technique (cont.) System Monitoring Webpage

  29. Processor Mapping Technique (cont.) Experimental Results

  30. Processor Mapping Technique (cont.) Experimental Results

  31. Processor Mapping Technique (cont.) Experimental Results

  32. Outline • Introduction • Regular / Irregular Data Distribution, Redistribution • Category of Runtime Redistribution Problems • Processor Mapping Technique for Communication Localization • The Processor Mapping Technique • Multi-Cluster Grid System • Scheduling Contention Free Communications for Irregular Problems • The Two-Phase Degree Reduction Method (TPDR) • Extended TPDR (E-TPDR) • Conclusions

  33. Application … Algorithm P Algorithm Q … Data distribution for algorithm P 7 16 11 10 7 49 Data distribution for algorithm Q 15 16 10 16 15 28 Scheduling Irregular Redistributions Example of GEN_BLOCK distributions • Enhance load balancing on heterogeneous environment

  34. Scheduling Irregular Redistributions (cont.) Example of GEN_BLOCK redistribution • Observation • Without cross communications

  35. :Node : Data communication SP1 SP2 SP3 TP1 TP2 TP3 Scheduling Irregular Redistributions (cont.) Convex Bipartite Graph

  36. A simple result. Scheduling Irregular Redistributions (cont.) Example of GEN_BLOCK redistribution Minimize communication step.Minimize the message size of total steps.

  37. Related Implementations • Coloring

  38. Related Implementations • LIST

  39. Related Implementations • DC1 & DC2 (a)DC1 (b)DC2

  40. Scheduling Irregular Redistributions (cont.) The Two Phase Degree Reduction Method • The First Phase (for nodes with degree >2) • Reduces degree of the maximum degree nodes by one in each reduction iteration. • The Second Phase(for nodes with degree = 1 and 2) • Schedules messages between nodes that with degree 1 and 2 using an adjustable coloring mechanism.

  41. Scheduling Irregular Redistributions (cont.) The Two Phase Degree Reduction Method S3: m11(6)、m5(3) ----6 • The first phase

  42. Scheduling Irregular Redistributions (cont.) The Two-Phase Degree Reduction Method S1:m1(7)、m3(7)、m6(15)、m8(4)、m10(8)、m13(18)---18S2:m2(3)、m4(4)、m7(3)、m9(10)、m12(12) ---12S3: m11(6)、m5(3) --- 6 • The second phase

  43. Scheduling Irregular Redistributions (cont.) Extend TPDR S1:m1(7)、m3(7)、m6(15)、m8(4)、m10(8)、m13(18)---18S2:m2(3)、m4(4)、m7(3)、m9(10)、m12(12) ---12S3: m11(6)、m5(3) --- 6 TPDR S1: m1(7) 、m3(7) 、m6(15) 、 m9(10)、m13(18) ---18S2: m4(4) 、m7(3) 、 m10(8)、m12(12) ---12 S3: m11(6)、m5(3) 、m2(3) 、m8(4)----6 E-TPDR

  44. Performance Evaluation Simulation of TPDR and E-TPDR algorithms on uneven cases.

  45. Performance Evaluation (cont.) • Simulation A is carried out to examine the performance of TPDR and E-TPDR algorithms on uneven cases.

  46. Performance Evaluation (cont.) • Simulation B is carried out to examine the performance of TPDR and E-TPDR algorithms on even cases.

  47. Performance Evaluation (cont.) • Simulation B is carried out to examine the performance of TPDR and E-TPDR algorithms on even cases.

  48. TPDR & E-TPDR for Scheduling irregular GEN_BLOCK redistributions Contention free Optimal Number of Communication Steps Outperforms the D&C algorithm TPDR(uneven) performs better than TPDR(even) Summary

  49. Performance Evaluation (cont.) 1000 test cases

  50. Performance Evaluation (cont.) Average

More Related