1 / 16

Parallel Instruction Set Extension Identification

Parallel Instruction Set Extension Identification. { dshap092, mmont044, mbolic } @ site.uottawa.ca http://carg.site.uottawa.ca/. Daniel Shapiro, Michael Montcalm and Miodrag Bolic. Overview. Introduction Prior Art Speedup Results Speedup Analysis Task Scheduling

gelsey
Download Presentation

Parallel Instruction Set Extension Identification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Instruction Set Extension Identification { dshap092, mmont044, mbolic } @site.uottawa.ca http://carg.site.uottawa.ca/ Daniel Shapiro, Michael Montcalm and MiodragBolic

  2. Overview • Introduction • Prior Art • SpeedupResults • Speedup Analysis • Task Scheduling • Parallel ISE Enumeration Experiment • Compiler Execution Time Results • Performance Analysis • Conclusion & Future Work

  3. Introduction • Hardware / software partition • COINS compiler (Java) • Control Flow Graph (CFG) • Data Flow Graph (DFG) of each basic block • Three address code using SSA • Modulefunctionbasicblockstatementnode • Now we have directed acyclic graphs for basic blocks • Enumerate convex subgraphs of each SSA DFG and then select the “best” subset Goal: Perform this compilation procedure faster

  4. Introduction • Instruction • Enumeration • Multicore desktops are common • Adapt the algorithm to run on a multicore • Thread pool of workers • In our case the Intel Core i7 with 12GB DDR3 RAM

  5. Introduction • Always use software optimizations before adding hardware • Enumeration algorithm changes execution time of compiler • Also the I/O constraint (e.g. 8,8) and the hardware size constraint (e.g. 10,000 LEs) change the model execution time. • The workload for the benchmark affects the observed speedup, as we will see.

  6. Prior Art • ISEs + threads in [2] to find maximal convex subgraphs of basic block dataflow graphs. • [2] only applies the inter-basic-block parallelism to the selection of ISEs and not their enumeration. • We adapt this idea of having a global solver to the approach of [1], which can find ISEs smaller than the maximal subgraphs identified by [5]. • Many groups have used Integer Linear Programming (ILP) to do ISE identification

  7. Prior Art • [6] used threads to perform parallel ISE enumeration within a basic block. • We go further and applied it to the scope of the control flow graph. • Our approach can be executed on as many processors as there are basic blocks • Our work can be combined with the existing approach of using multiple threads to perform ISE enumeration on a single basic block.

  8. Speedup Results Workload Hardware constraint

  9. Speedup Analysis • Search space limited algorithm • Intra-basic block • I/O constrained • No pointers in HW (illegal nodes) • 2x speedup is nice • If we just make everything into hardware, then we cannot update the program without a firmware update.

  10. Task Scheduling • Use well-know thread creation techniques to accelerate the ISE enumeration part of ISE identification • Scheduling is performed with inexact information in order to save compilation time (# stmt instead of # nodes). • Scheduling the parallel tasks quickly and intelligently is critical (see right).

  11. Parallel ISE Enumeration Experiment • Used greedy ISE enumeration algorithm • I/O constraint of (8,8) • Hardware size constraint of 10K LEs and 10M LEs (in practice we gathered much more data) • Compared the sequential and parallel approaches to ISE enumeration • Speedup observed, but data for algorithm execution time were not as expected.

  12. Compiler Execution Time Results Speedupreversal -6% +53%

  13. Performance Analysis • Compiler execution time • So far, results are only positive sometimes. • We expected much better numbers for such a powerful computer. • Additional overhead time was needed for creating, distributing, and then collecting the thread data. (not the problem) • Probably there is still a memory dependency.

  14. Conclusion & Future Work Conclusions Future work Analyze the source of the overhead using vTune Reduce the source of the overhead, once identified Distribute the enumeration of ISEs across multiple computers, perhaps using Microsoft Solver Foundation • Using multiple threads for ISE enumeration is beneficial on average • Peak 53.7% faster • To our knowledge this is the first use of this technique in the literature • Approach is applicable to many ISE enumeration algorithms

  15. References [1] K. Atasu, G. Dundar, and C. Ozturan, “An integer linear programmingapproach for identifying instruction-set extensions,” in Third IEEE/ACM/IFIP International Conference on Hardware/Software Codesignand System Synthesis, 2005, pp. 172–177. [2] C. Galuzzi, E. M. Panainte, Y. Yankova, K. Bertels, and S. Vassiliadis, “Automatic selection of application-specific instruction-set extensions,” in Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis, 2006, pp. 160–165. [3] K. Atasu, L. Pozzi, and P. Ienne, “Automatic application-specificinstruction-setextensionsundermicroarchitecturalconstraints,” in Design Automation Conference, 2003, pp. 256–261.

  16. Questions?

More Related