1 / 29

Exploiting Vector Parallelism in Software Pipelined Loops

Exploiting Vector Parallelism in Software Pipelined Loops. Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology. Multimedia Extensions. Short vector extensions in ILP processors AltiVec, 3DNow!, SSE, etc.

yves
Download Presentation

Exploiting Vector Parallelism in Software Pipelined Loops

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Vector Parallelismin Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

  2. Multimedia Extensions • Short vector extensions in ILP processors • AltiVec, 3DNow!, SSE, etc. • Accelerate loops in multimedia & DSP codes • New designs have floating point support

  3. Vector resources do not overwhelm the scalar resources Scalar: 2 FP ops / cycle Vector: 4 FP ops / cycle Full vectorization may underutilize scalar resources ILP techniques do not target vector resources Need both Multimedia Extensions Courtesy of International Business Machines Corporation. Unauthorized use not permitted.

  4. LOAD LOAD MULT II = 2 mod sched ADD Modulo Scheduling for (i=0; i<N; i++) { s = s + X[i] * Y[i]; }

  5. II = 2 mod sched II = 2 traditional Traditional Vectorization for (i=0; i<N; i+=2) { S[i:i+1] = X[i:i+1] * Y[i:i+1]; } for (i=0; i<N; i+=2) { S[i:i+1] = X[i:i+1] * Y[i:i+1]; } for (i=0; i<N; i++) { s = s + S[i]; } for (i=0; i<N; i++) { s = s + S[i]; } +1

  6. II = 2 mod sched II = 3 traditional II = 1.5 no distrib Vectorization without Distribution for (i=0; i<N; i+=2) { S = X[i:i+1] * Y[i:i+1]; s = s + S0 s = s + S1 }

  7. II = 2 mod sched II = 3 traditional II = 1.5 no distrib II = 1 selective Selective Vectorization for (i=0; i<N; i+=2) { S = X[i:i+1] * Y[i]:Y[i+1]; s = s + S0 s = s + S1 }

  8. Complications • Complex scheduling requirements • Particularly in statically scheduled machines • Memory alignment • Example assumes no communication cost • In reality, explicit operations required • Often through memory • Reserve critical resources • Potential long latency • Performance improvement still possible

  9. Tomcatv main loop (50%)

  10. Tomcatv (SpecFP 95) 1.7x Speedup over Modulo Scheduling

  11. Tomcatv (SpecFP 95)

  12. Selective Vectorization • Balance computation among resources • Minimize II when loop is modulo scheduled • Carefully manage communication • Incorporate alignment information • Software pipelining hides latency • Adapt a 2-cluster partitioning heuristic • [Fidduccia & Matheyses ’82] • [Kernighan & Lin ’70]

  13. LOAD LOAD MULT ADD Selective Vectorization scalar vector cost

  14. Cost Function • Projected II due to resources (ResMII) • Bin-packing approach [Rau MICRO ’94] • With some modifications • Can ignore operation latency • Software pipelining hides latency • Vectorizable ops not on dependence cycles for (i=0; i<N; i++) { X[i+4] = X[i]; }

  15. SUIF Front-end Dependence Analysis Dataflow Optimization SUIF to Trimaran Selective Vectorization Modulo Scheduling Evaluation C or Fortran • SUIF front-end • Dependence analysis • Dataflow optimization • Trimaran back-end • Modulo scheduler • Register allocator • VLIW Simulator • Added vector ops Simulation Binary

  16. Evaluation • Operands communicated through memory • Software responsible for realignment

  17. Evaluation • SpecFP 92, 95, 2000 • Easier to extract dependence information • Detectable data parallelism • 64-bit data means vector length of 2 • Considered amenable to vectorization & SWP • Apply selective vectorization to DO loops • No control flow, no function calls • Fully simulate with training sets

  18. Traditional Vectorization

  19. Vectorization without Distribution

  20. Vectorization + Free Communication

  21. Vectorization without Distribution

  22. Selective Vectorization

  23. Selective Vectorization tomcatv mgrid su2cor swim

  24. Communication Support • Transfer through memory • Register to register copy • Uses fewer issue slots • Frees memory resources • Shared register file • Vector elements addressable in scalar ops • Requires no extra issue slots

  25. Through Memory tomcatv mgrid su2cor swim

  26. 1.2x improvement Reg to Reg Transfer Support tomcatv mgrid su2cor swim

  27. 1.28x improvement Shared Register File tomcatv mgrid su2cor swim

  28. Related Work • Traditional vectorization • Allen & Kennedy, Wolfe • Software Pipelining • Rau’s iterative modulo scheduling • Clustered VLIW • [Aleta MICRO34], [Codina PACT01], [Nystrom MICRO31], [Sanchez MICRO33], [Zalamea MICRO34] • Partitioning among clusters similar • Ours is also an instruction selection problem • No dedicated communication resources

  29. Conclusion • Targeting all FUs improves performance • Selective vectorization • Vectorization better in the backend • Cost analysis more accurate • Software pipeline vectorized loops • Good idea anyway • Facilitates selective vectorization • Hides communication and alignment latency

More Related