160 likes | 287 Views
Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads. Hideki Miwa † , Yasuhiro Dougo ‡ Victor M. Goulart Ferreira † Koji Inoue † and Kazuaki J. Murakami † † Dept. of Informatics, Kyushu Univ., Japan ‡ Dept. of Electronics Eng., Fukuoka Univ., Japan.
E N D
Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Hideki Miwa† , Yasuhiro Dougo ‡Victor M. Goulart Ferreira † Koji Inoue † and Kazuaki J. Murakami † † Dept. of Informatics, Kyushu Univ., Japan ‡ Dept. of Electronics Eng., Fukuoka Univ., Japan ICSeng 2005
Outline • Background • Memory wall problem • Delinquent loads • Load data re-computation • Preliminary evaluations and results • Summary ICSeng 2005
What is the Memory Wall Problem? • Performance gap between the microprocessor and the main memory (DRAM) has been growing. • Intel Pentium4: 3.8GHz • DDR2 SDRAM: 533MHz • It takes many clock cycles to fetch necessary data from the off-chip! DRAM suppresses microprocessor performance! “Memory Wall Problem” ICSeng 2005
Who is the person responsible for cache miss? • A small number of static load instructionsfrequently cause data cache misses. Delinquent Load (DL) Instruction • We expect a large performance improvement by defeating DL. Fraction of cache misses caused by 16 loads (%) ICSeng 2005
How to beat DL? • Data prefetching • Speculative Data-Driven Multithreading[Roth et al.,2001] • Speculative Pre-Computation[Collins et al.,2001] • (many other approaches including compile-time methods) • Data re-computation • Load Data Re-Computation (LDRC) ICSeng 2005
Concept of the Load Data Re-Computation (LDRC) Conventional microprocessor Long reference time! Our method Re-compute again! Referred by DL ICSeng 2005
Re-computation code generation procedure LDRC applied object code Source code Object code (static/dynamic) c = a + b; Load a Load b Add c, a, b Store c Load a Load b Add c, a, b Store c RC code Data ‘c’ will be written back to main memory. ‘Load c’ is replaced by RC code Load a Load b Add c, a, b ... ... ... [RC code] (Re-Computation) DL z = x + c; Load c Load x Add z, x, c Store z Load x Add z, x, c Store z We can reduce off-chip accesses, if other data are on the cache. Data ‘c’ is not on the cache, so causes cache-miss! ICSeng 2005
Preliminary Evaluation • Objective • Performance Impact of LDRC • Assumption • Processor configuration • 4way out-of-order 32-bit microprocessor • iL1/dL1/uL2 caches = 32KB/32KB/2MB • Memory access latency = 250 clock cycles • DL instruction • Top 16 static load instructions which cause cache misses most frequently • Environment • Simulator: SimpleScalar 3.0d • Benchmark set: SPEC CPU 2000 ICSeng 2005
Execution time reduction via LDRC Execution time reduction rate (%) Benchmark programs ICSeng 2005
Execution time reduction via LDRC Execution time reduction rate (%) Benchmark programs ICSeng 2005
Requirements to obtain a large performance improvement (1) Large effect of DL on total execution time (2) High replaceability of DL by re-computation (3) Short re-computation time (3) (2) (1) Re-computable DLs Reduce! Execution time Re-compute Other Insts. Original LDRC applied ICSeng 2005
(1) Effect of DL on total execution time Execution time reduction [%] Benchmark programs ICSeng 2005
(2) Replacability of DL with re-computation DL instructions Replaced by RC codes [%] Benchmark programs ICSeng 2005
(3) Re-Computation time Average execution time for RC codes [Clock Cycles] Benchmark programs ICSeng 2005
Summary • LDRC can achieve up to 47% of execution time reduction. • Suitable applications: • Large impact of DL instructions on execution time. • High ratio of replaceable DL instructions. • Short re-computation time for RC codes. • Future work • We have to consider the implementation issues. ICSeng 2005
The end Thank you for your attention! ICSeng 2005