1 / 16

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads. Hideki Miwa † , Yasuhiro Dougo ‡ Victor M. Goulart Ferreira † Koji Inoue † and Kazuaki J. Murakami † † Dept. of Informatics, Kyushu Univ., Japan ‡ Dept. of Electronics Eng., Fukuoka Univ., Japan.

sera
Download Presentation

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Hideki Miwa† , Yasuhiro Dougo ‡Victor M. Goulart Ferreira † Koji Inoue † and Kazuaki J. Murakami † † Dept. of Informatics, Kyushu Univ., Japan ‡ Dept. of Electronics Eng., Fukuoka Univ., Japan ICSeng 2005

  2. Outline • Background • Memory wall problem • Delinquent loads • Load data re-computation • Preliminary evaluations and results • Summary ICSeng 2005

  3. What is the Memory Wall Problem? • Performance gap between the microprocessor and the main memory (DRAM) has been growing. • Intel Pentium4: 3.8GHz • DDR2 SDRAM: 533MHz • It takes many clock cycles to fetch necessary data from the off-chip!  DRAM suppresses microprocessor performance! “Memory Wall Problem” ICSeng 2005

  4. Who is the person responsible for cache miss? • A small number of static load instructionsfrequently cause data cache misses.  Delinquent Load (DL) Instruction • We expect a large performance improvement by defeating DL. Fraction of cache misses caused by 16 loads (%) ICSeng 2005

  5. How to beat DL? • Data prefetching • Speculative Data-Driven Multithreading[Roth et al.,2001] • Speculative Pre-Computation[Collins et al.,2001] • (many other approaches including compile-time methods) • Data re-computation • Load Data Re-Computation (LDRC) ICSeng 2005

  6. Concept of the Load Data Re-Computation (LDRC) Conventional microprocessor Long reference time! Our method Re-compute again! Referred by DL ICSeng 2005

  7. Re-computation code generation procedure LDRC applied object code Source code Object code (static/dynamic) c = a + b; Load a Load b Add c, a, b Store c Load a Load b Add c, a, b Store c RC code Data ‘c’ will be written back to main memory. ‘Load c’ is replaced by RC code Load a Load b Add c, a, b ... ... ... [RC code] (Re-Computation) DL z = x + c; Load c Load x Add z, x, c Store z Load x Add z, x, c Store z We can reduce off-chip accesses, if other data are on the cache. Data ‘c’ is not on the cache, so causes cache-miss! ICSeng 2005

  8. Preliminary Evaluation • Objective • Performance Impact of LDRC • Assumption • Processor configuration • 4way out-of-order 32-bit microprocessor • iL1/dL1/uL2 caches = 32KB/32KB/2MB • Memory access latency = 250 clock cycles • DL instruction • Top 16 static load instructions which cause cache misses most frequently • Environment • Simulator: SimpleScalar 3.0d • Benchmark set: SPEC CPU 2000 ICSeng 2005

  9. Execution time reduction via LDRC   Execution time reduction rate (%)     Benchmark programs ICSeng 2005

  10. Execution time reduction via LDRC Execution time reduction rate (%) Benchmark programs ICSeng 2005

  11. Requirements to obtain a large performance improvement (1) Large effect of DL on total execution time (2) High replaceability of DL by re-computation (3) Short re-computation time (3) (2) (1) Re-computable DLs Reduce! Execution time Re-compute Other Insts. Original LDRC applied ICSeng 2005

  12. (1) Effect of DL on total execution time Execution time reduction [%] Benchmark programs ICSeng 2005

  13. (2) Replacability of DL with re-computation DL instructions Replaced by RC codes [%] Benchmark programs ICSeng 2005

  14. (3) Re-Computation time Average execution time for RC codes [Clock Cycles] Benchmark programs ICSeng 2005

  15. Summary • LDRC can achieve up to 47% of execution time reduction. • Suitable applications: • Large impact of DL instructions on execution time. • High ratio of replaceable DL instructions. • Short re-computation time for RC codes. • Future work • We have to consider the implementation issues. ICSeng 2005

  16. The end Thank you for your attention! ICSeng 2005

More Related