National Sun Yat-sen University Embedded System Laboratory

National Sun Yat-sen University Embedded System Laboratory The Multi-threaded Optimization of Dynamic Binary Translation Presenter: Ming-ShiunYang 2011, Eigth International Conference on Fuzzy Systems and Knowledge Discovery(FSKD) Jinxian Cui, Jianmin Pan, Zheng Shan, Xianan Liu

Abstract Dynamic binary translation which offers a solution for making an executable compiled on a platform to run on another platform automatically resolves the problem of code migration. The current single-thread dynamic binary translation systems don’t have enough scope to improve the performance. Therefore, the multi-core processors and the multi-threaded program are fully used for getting the high performance. Therefore, based on the performance analysis and experiments results of a single-threaded dynamic binary translation system, this paper shows the framework of MDT(Multi-Threaded Dynamic Binary Translation) system and the optimizations implemented on it.For achieving the speculatively translation scheme, the T-Tree(Translation Tree) is built by the servant thread which gives the direction

Abstract (cont.) of getting the next pc. Besides that, integrated with the merit of multi-levelCache scheme, the LRU schemeand the full flush scheme, a new management of T-Cache is presented which proves out managing the translated blocks efficiently. The framework and optimizations of MDT are evaluated wholly and partly across SPEC2006under the Alpha multi-core environment. The results compared with QEMU demonstrate that the speculative translation and the reducing of T-Cache missing rate are effectively.

What’s the problem? • Single-threaded dynamic translation(QEMU) is not enough to improve performance. • From Table 1 and Figure 2, it’s indicate that the translation and execution time of TB is important. For the small / value case, the translation time of TBs is more important. For the big / value case, the execution time of TBs is more important. The / value is approximate 40/1 , it means is the key performance factor

What’s the problem?(Cont.) • The T-Cache flush mechanism of QEMU is that if T-Cache is full then flush the whole T-Cache. • High cost of executing TBs Translate TB to micro-operations Link each micro-operations together Map executable sections on host memory

Related Work [1] Binary Translation Introduction [7] ISS introduction Used in [6] QEMU – A dynamic translator [3,4,5] Dynamic Binary Translation System [11] Multi-threadd optimization for dynamic translator T-Cache management mechanism Single-threaded dynamic binary translation Used many servant threads to do translation, it’s need much cost on managing threads This paper : The Multi-threaded Optimization of Dynamic Binary Translation

Proposed : MDT Framework • MDT(Multi-threaded Dynamic Translation) • To parallelize execution and TB’s translation

Proposed : Translator • Speculative translation • Translate TB which may be executed next. • To reduce the translation delay time • To achieve speculative translation • Using T-Tree(Translation-Tree)to get the next pc address Execute TB = TBs Translate next TB How to get the next PC address ? Execute next TB …

The Flow of Getting Next PC • tsl_q : to reserve the TB that have been translated

Proposed :T-Cache Management Scheme • Multi-level Cache scheme • Advantage : reduce cache miss rate • LRU scheme • Advantage : easy, fast • Full flush scheme • Advantage : simple, low cost • The proposed T-Cache management scheme integrate with above scheme.

Proposed :T-Cache Management Scheme(cont.) Buff[1] (64k) (an unit of full flush) TB Chain TB T-Cache (4M) (an unit of full flush originally) Buff[2] (64k) (an unit of full flush) TB TB …… …… TB Buff[64] (64k) (an unit of full flush) Using LRU / LRC replace scheme to get a buffer when all buffer is full

The Flow Chart of T-Cache Management Using LRU to get a new buffer Using LRU to get a new buffer

Before Experiment • How much performance does MDT improve? • Compare with single-threaded dynamic binary translation system (QEMU) • The proposed T-Cache management mechanism save how many costs of flushing T-Cache?

Experimental Results • Host : Alpha / Red Hat Linux • Target : I386 / Red Hat Linux • Test cases : SPEC 2006 Miss rate Speculatively Translation Rate (Hit Rate) Average TranslationTimes Because the proposed T-Cache management mechanism reduce the T-Cache miss rate, so the ATT of MDT is less than QEMU

Conclusion • Proposed MDT parallelize the execution and translation of TBs to improve the performance. • Main Thread • Execute TBs • Servant Thread • Translate TBs – Speculatively Translation • New mechanism for managing T-Cache

My Comment • If we want to modify QEMU to support multi-threaded dynamic binary translation, it will be a big engineering. • Whole work flow need to be modified. • Some techniques are not illustrated. • LRC scheme • The illustration of T-Tree generation is not enough. • How to deteemine the right child

National Sun Yat-sen University Embedded System Laboratory