480 likes | 627 Views
Codesigned Virtual Machines. 김 덕 환 Kim, Dukhwan (kyom@SSL.snu.ac.kr) 2005.10.31. Contents. What is the Codesigned VM? Memory & Register State Mapping Self-Modifying Code Support for Code Caching Implementing Precise Traps Input/Output Applying Codesigned VM Case Study : next class.
E N D
Codesigned Virtual Machines 김 덕 환 Kim, Dukhwan (kyom@SSL.snu.ac.kr) 2005.10.31
Contents • What is the Codesigned VM? • Memory & Register State Mapping • Self-Modifying Code • Support for Code Caching • Implementing Precise Traps • Input/Output • Applying Codesigned VM • Case Study : next class
What is the Codesigned VM? (1) Application Binary Application Binary Application Binary Native ISA Virtual ISA Virtual ISA Hardware Virtual Machine VM Software Native ISA Implementation ISA VM Hardware Hardware Conventional HW/SW interface Conventional Virtual Machine interface HW/SW Codesigned Virtual Machine
What is the Codesigned VM? (2) • Host architecture (target ISA) is designed concurrently with the VM SW. • SW becomes part of the HW platform We can divide the implementation of HW & SW optimally.
What is the Codesigned VM? (3) • Codesigned VM vs. System VM • Support an entire system (OS + App) Codesigned VM is a form of system VM. • But in codesigned VM, • Not intended to virtualize HW resources • Not intended to support multiple VM environment. • Goals include performance, power efficiency, design simplicity.
What is the Codesigned VM? (4) • Codesigned VM as a System VM • We refer to the VM SW as a VM Monitor (VMM) Application OS Source ISA (IA32) VMM Code Cache Translator Target ISA (Crusoe) Hardware
What is the Codesigned VM? (5) • Codesigned VM vs. Process VM • Similarity : emulate the source ISA, dynamic translation, code cache • But in codesigned VM, • Intrinsic compatibility at the ISA level (not ABI level) Both user-level & system-level instructions must be emulated. • Improved performance, power efficiency, design simplicity. Compatibility is just a requirement, not a motivation.
What is the Codesigned VM? (6) • Codesigned VM vs. Superscalar processor • Similarity : perform translation • source ISA (IA32) target ISA (P4, Crusoe) • But in codesigned VM, • The translation is done in SW. less cost, small size, design simplicity, much more optimization opportunities • Inter-instruction optimization is possible.
Code Translation Methods source target source target micro-op a micro-op b micro-op c micro-op d micro-op e . . . micro-op pmicro-op qmicro-op r instr. A instr. B instr. C instr. D . . . instr. M instr. 1 instr. 2 instr. 3 . . . instr. n instr. 1 instr. 2 instr. 3 . . . instr. n Code translation by SW Code translation by HW
Register State Mapping • Register state mapping is easy. • Host register files can be made larger than the guest’s. PowerPC Daisy host ScratchSpeculative ResultsConstantsPointers
Memory State Mapping • Concealed Memory • A reserved region for VMM, code cache, other data used by VMM. • Never visible to guest SW. • This is possible because VMM takes control from the boot process. • Fixed size, normally diskless (to simplify the system design) • VMM may be stored in ROM.
Concealed Memory (1) • Memory system in Codesigned VM • I-cache only holds target ISA instructions Code Cache I-Cache Processor Core Concealed Memory VMM Code VMM Data D-Cache Source ISA Code Conventional Memory Source ISA Data
concealed memory mapping guest memory mapping Concealed Memory (2) • Memory mapping for Concealed memory 1. Concealed logical memory shares address space with the guest • Host address space must be enlarged Concealed real memory Concealed Logical Address Conventional real memory Conventional Logical Address
Concealed Memory (3) • Memory mapping for Concealed memory 2. Two separate logical address spaces. • Load/Store must select the mapping. • This can be controlled by the VMM. Concealed Logical Address Concealed real memory concealed memory mapping Conventional real memory Conventional Logical Address guest memory mapping
Concealed Memory (4) • Memory mapping for Concealed memory 3. Use real addressing for concealed memory. • Special case of option 2. • Separate set of Load/Store, or a mode bit. Concealed Real Address space Concealed real memory Conventional real memory Conventional Logical Address guest memory mapping
Self-Modifying Code (1) • Basically use same technique as in Ch 3. • Keep guest OS’s virtual-to-real page mapping • All load/store addresses are mapped to source memory region. • Write-protect guest code region Any attempt to write into that region will cause a trap. Then VM can handle this. • But, • We cannot use a system call to write-protect, because we should not modify guest’s page state.
Self-Modifying Code (2) • Solution : Use TLB for write-protect • TLB is managed by VMM, has an additional bit indicating “write-protect”. • The VMM sets write-protect bit whenever an entry for a code page is loaded into TLB. • VMM should maintain a table of all the guest virtual pages for translated code. • Moreover, because we are discussing about codesigned VM, we can get …
Self-Modifying Code (3) • Special hardware support. • In the Transmeta Crusoe, a special hardware structure is added to speedup fine-grained write-protection checking. • Goal : Find out whether this is really write to translated code region • Virtual address (TLB) Real address (Filtered by write-protect table) write fault or not
Self-Modifying Code (4) source addr WP. bits TLB Page level write-protect fault Write-Protect Table virt. page No. hit/miss source code write fault phys. page No. Comparison Logic wp bit mask Page Offset Bits
Self-Modifying Code (5) • I/O writes to guest code memory must be caught. • For translated code in the code cache, keep track of all the real guest pages. • Maintain a hardware table for I/O writes – entries for all the real pages that hold guest code page. • A store to any of these pages cause an interrupt to the VMM. Then VMM flushes the translated code.
Support for Code Caching (1) • Code cache performance is the most important. • SPC (Hash) TPC (if hit) access code cache • Involves multiple mem access + indirect jump • Superblock chaining may help – eliminate table lookup, direct jumps and branches • But how about indirect jumps?
Support for Code Caching (2) • To reduce table lookup overhead – use SW-based jump target prediction • But, • If SW prediction is incorrect time is wasted • Many indirect jumps are difficult to predict (ex. returns) • Again, for codesigned VM, we can get… if (Rx == #addr_1) goto #target_1else if (Rx == #addr_2) goto #target_2else map_lookup(Rx)
Support for Code Caching (3) • Hardware support for code caching. • JTLB (Jump Translation Lookaside Buffers) • D-RAS (Dual-address Return Address Stack)
JTLB (1) • “a specially designed HW cache of map table entries” TPC MUX Hash SPC Tag Compare tag select tag hit or miss JTLB
JTLB (2) • JTLB_Lookup instruction • Lookup_Jump instruction and prediction • Predict using BTB and fetch • JTLB hit and prediction correct Happy • JTLB hit but misprediction Redirect fetch to jump target TPC from JTLB • JTLB miss Redirect fetch to fall-through addr. TPC hit/miss SPC JTLB_Lookup Ri, Rj, RkJump Ri, Rj == 0Jump map_lookup
D-RAS (1) • The RAS (Return address stack) helps solving return-jump problem. • In codesigned VM, • We need TPC (not SPC) • If the procedure call is at the end of a translated superblock, the return address may not be correct. Translation Block A Call ??? Translation Block X Return
D-RAS (2) • A specialized dual-address RAS is used. Push_DRASinstruction Predicted SPC Opcode SPC TPC Predicted TPC Returninstruction Opcode SPC push pop Dual-Address Return Address Stack
Implementing Precise Traps • Similar techniques in Chapter 3, 4 • Maintain SW checkpoints • Code motion with extending register live range • Trap occurs Interpretation beginning at the checkpoint to establish correct state • In codesigned VM, • Enough registers live ranges can be extended with less register pressure • Restriction of code motion is relaxed. Why?
HW Support for Checkpoints (1) • Use HW to set a checkpoint when each translation block is entered. set checkpoint Translation Block A set checkpoint Translation Block B set checkpoint Translation Block C set checkpoint Translation Block N
HW Support for Checkpoints (2) • If a trap occurs, • HW restores the state at the beginning of the block. • Then interpretation is used to provide the precise exception state. Source code interpret Translation Block A restore checkpoint Translation Block B trap !
HW Support for Checkpoints (3) • When a new translation block is entered, • The state from the previous block is “committed” • And a new checkpoint is set. • Setting register checkpoint – Using shadow copy of registers • When checkpoint is set – registers are copied to shadow registers. • When a trap occurs – copy back from shadow registers to working registers • These copying are done very fast.
HW Support for Checkpoints (4) • Checkpointing memory – store operations are buffered • Until the current translation block is exited (committed) • If an exception occurs, the buffered stores are flushed • This is done by HW • Restrictions on code motion are relaxed. • Fixed size of store buffer constrain the translation block size.
Page Fault Compatibility (1) • Guest OS must observe exactly the same page fault as on a native platform. • If guest OS manages conventional memory, (still, mem mapping by host) • Page fault for data region will be detected naturally • During interpretation, page fault for code region will also be detected. • But executing translation code does not fetch any code from the guest memory
Page Fault Compatibility (2) • When a translated instruction is fetched from the code cache, we trigger a page fault, if • the corresponding guest instruction would have caused a page fault on a native platform. • How to do this? • Active approach • Lazy approach
Active Page Fault Detection (1) • Monitor potential page replacement by the guest OS. • Assuming architected page table, VMM can identify the mem region of page table. • VMM monitors the guest OS’s modification to the architected page table. • By write-protecting the page table, VMM can monitor any change of a virtual page mapping. • VMM keeps a table for : in which virtual pages each source instructions is contained
Active Page Fault Detection (2) • If the page table is modified, • VMM flushes all the translations in the code cache derived from that (modified) page. • Table 1 - Each source page : all the translation block (must-be-flushed blocks) • Table 2 – keep track of any link backpointers • links (for removed pages) are changed to point VMM emulation manager. • emulation process will detect the instruction page fault.
Lazy Page Fault Detection (1) • Code cache flushing is postponed until actual use of the replaced code. • Every time the translated code crosses a source page boundary, check the page table. • At the time crossing the boundary, Verify_Translation instruction is inserted. • It checks the page mapping • page mapped correctly proceed • page not mapped page fault
A B C A A B B C C D D D E F G E E F F G G H I J H H I I J J K L K K L L Lazy Page Fault Detection (2) Code Cache Guest Pages Page correctly mapped? No Jump to VMM Probe page table continue execution Yes Verify_Translation instruction
Input/Output (1) • If the VMM does not use any I/O, • All the guest device drivers can be run as is. • Any I/O instructions or memory mapped I/O is simply passed through. • Volatile memory inhibit optimization. So we need to identify access to the volatile memory. • Use access-protect bit : load/store to that page trap deoptimize for correct sequence. • Special volatile version of load/store
Input/Output (2) • Using disk in VMM • for disk-based code cache approach – large, persistent code cache • requires relaxed transparency • “concealed secondary storage” • VMM-aware special disk driver Guest OS Special Disk Driver VMM Concealed Disk region
Applying Codesigned VM (1) • Advantage at the macro level : • New ISAs can be implemented • Efficiency in instruction-level parallelism (Crusoe) • Simplify instruction issue logic • High-level object-oriented source ISA (IBM AS/400)
Applying Codesigned VM (2) • Advantage at the micro level : • Codesigned VM permit the implementation of specific performance enhancement. • Implementation-dependent profiling HW can be built in for use by dynamic translating/optimizing SW.
References • http://www.research.ibm.com/vliw/pubs.html • Smith, J.E. et al., “Achieving High Performance via Co-Designed Virtual Machine”, IWIA 1999 • Dehnert, J.C. et al, “The Transmeta Code Morphing Software : Using Speculation, Recovery, and Adaptive Retranslation to Address Real-Life Challenges”, CGO 2003