270 likes | 566 Views
National Sun Yat-sen University Embedded System Laboratory PQEMU: A Parallel System Emulator Based on QEMU. Presenter: Zong Ze - Huang. Jiun -Hung Ding , Po-Chun Chang , Wei-Chung Hsu , Yeh-Ching Chung Parallel and Distributed Systems(ICPADS), 2011 IEEE 17 th International Conference on.
E N D
National Sun Yat-sen University Embedded System LaboratoryPQEMU: A Parallel System Emulator Based on QEMU Presenter: ZongZe - Huang Jiun-Hung Ding , Po-Chun Chang , Wei-Chung Hsu , Yeh-Ching Chung Parallel and Distributed Systems(ICPADS), 2011 IEEE 17th International Conference on
Abstract • A full system emulator, such as QEMU, can provide a versatile virtual platform for software development. However, most current system simulators do not have sufficient support for multi-processor emulations to effectively utilize the underlying parallelism presented by today’s multi-core processors. In this paper, we focus on parallelizing a system emulator and implement a prototype parallel emulator based on the widely used QEMU. Using this parallel QEMU, emulating an ARM11MPCore platform on a quad-core Intel i7 machine with the SPLASH-2 benchmarks, we have achieved 3.8x speedup over the original QEMU design.
What is the Problem • Current design of QEMU is only suitable for single-core processor emulation. • When executing a multi-threaded application on a multi-threaded application on a multi-core machine, QEMU emulates the execution of the application in serial and cannot take advantage of the parallelism available in the application and the underlying hardware.
Related work RSIM[14], SimOS[11], QEMU[6] Simics[12], Mambo[2] Coremu[16] Simulation Functional simulation Micro-architectural simulation SimpleScalar[5] Wattch[4] SimOS[11] Simics[12] Increased simulation efficiency Dynamic binary translation[3] Full system simulation Implement a protorype parallel emulator based on the widely used QEMU PQEMU: A Parallel System Emulator Base on QEMU This paper:
Proposal method • Propose a novel design of a multi-threaded QEMU, called PQEMU. • Unified code cache design • Separate code cache design
QEMU vs PQEMU • QEMU work for Multi-core guest • PQEMU work for Multi-core guest Host OS scheduler Host OS scheduler QEMU QEMU Guest processor Guest processor Thread on host machine Thread on host machine Physical core Physical core P0 P0 T0 G0 G0 T0 P1 P1 T1 G1 G1 P2 G2 T2 G2 P2 Round-Robin P3 G3 T3 G3 P3
Computer System in QEMU IO thread CPU 0, 1 Memory CPU Idle Find Invalidate Flush Build Chain Code Cache SDRAM Execute Restore RAM Block FLASH Help Function Soft MMU RAM Block Exception/Interrupt Check Keystroke receive Screen update IO Interrupt notification: Unchain I/O Device Model Alarm signal
Computer System in PQEMU Emulation threads group Emulation thread #0 Emulation thread #1 Unified Code Cache CPU 0 CPU 1 IO thread Memory IO
Typical flow of dynamic binary translation • Translated Block(TB) • A unit of basic block • Two architecture states • Emulator executes in the code cache(dark grey boxs). • Emulation manager(white boxs).
DBT share components • TCG translation engine(TCG): • It is the binary translation engine in system emulator. • Code Cache(CC): • The storage space for TB output after Build. • TB Descriptor(TBD): • It holds the meta-information of a TB in code cache. • TB Descriptor Array(TBDA): • Simplify the management of TB descriptors. • TB Hash Table(TBHT): • It is the central hash table in key of guest PC value that Find Slow searches after Find Fast fails. Every in-use TBD has an index in this hash table to reference to.
DBT share components • TB Descriptor Pointer(TBDP): • It is a field private to each guest cores that holds the index to recently-used TBD (duplicated from previous hash table). • Memory Page Descriptor(MPD) • To accelerate the detection of guest SMC activity.
Unified Code Cache(UCC) Design • Independent : never use the same shared component. ex: Find Slow and Restore. • Synchronous : component is shared among all emulation thread. ex: Restore and Build. • Dependent : though something is shared on table but no simultaneous access would happen in real life. ex: Build with Chain/Unchain/Execute.
Four independent sets and two rules • Four independent sets: • Construct = { Find Fast, Find Slow, Build and Restore } • Link = { Chain, Unchain } • Use = { Execute } • Destruct = { Flush, Invalidate } • Two rules: • Any two states live in the same set must run sequentially, except those pure read operations like Find Fast, Find Slowand Execute; otherwise they could go parallel. • Destruct requires an exclusive access for efficiency reason, since the states will modify most of sharing components all at once.
Emulation flow for PQEMU using UCC • Deploy locks only at state combination in Synchronous. • Exclusive_rwlock, Build_lock and Chain_lock
Find Slow Optimization for UCC Design • Independent relationship between Find Slow and Build as following: • Construct = { Build and Restore } • Search = { Find Fast and Find Slow } • Revise the rule 1 • Any two states live in the same set must run sequentially, except Search; otherwise they could go parallel. • Optimization also introduces the redundancy problem • Induce memory waste but not impact correctness.
Separate Code Cache(SCC) Design • Duplicates all sharing components for every emulation thread except MPD.
I/O system in real world Single core Multi-cores CPU 0 IO CPU 1 1 CPU IO 2 Time 1 1 3 2 2 Time 4 3 3 4 5 5
I/O system in QEMU’s world Original parallel I/O system CPU 0 IO CPU 1 CPU IO 1 2 Time 1 4 2 5 4 Time 1 5 3 2 4 3 5 3
How to prove the proposal • Run benchmark on various emulation designs and compare with baseline QEMU. • P-UCC • P-UCC+IO • P-UCC+IO+FS • P-SCC • P-SCC+IO • Coremu
Experimential Results • One working thread • On average has 5~10% slowdown to compared with baseline QEMU. • Four working thread • On P-UCC+IO+FS has 3.72x speed up to compared with baseline QEMU.
Tradeoffs between UCC and SCC design • SCC need more memory space and translation time, but it eliminates most synchronization. • Invalidate in SCC incurs more overhead, because update has to apply to all duplicated sharing components. • Latency of guest interrupt in UCC is slightly worse than SCC, because of the contention for TB chaining and unchaining. • SCC may be too costly in terms of the memory overhead when emulating a many-core but have best for running parallel applications with massive code sharing.