190 likes | 422 Views
Indirect Branching in the Transmeta Efficeon Processor. Naveen Kumar and Naveen Neelakantam Intel Corporation. Introduction. Transmeta Efficeon processor HW/SW co-designed processor marketed in 2003 Binary translation of x86 to underlying VLIW hardware
E N D
Indirect Branching in the Transmeta Efficeon Processor Naveen Kumar and Naveen Neelakantam Intel Corporation
Introduction • Transmeta Efficeon processor • HW/SW co-designed processor marketed in 2003 • Binary translation of x86 to underlying VLIW hardware • Focus on how Efficeon handles indirect branches • Indirect branches are particularly difficult for binary translation • Efficeon provided a number of unique solutions • Many interesting HW/SW solutions to improve efficiency • Our hope is that we can use and build upon these ideas
Disclaimer and Acknowledgement • A review of past work, not original research by authors • Efficeon was implemented by Transmeta, but details rarely published • Acknowledgement and thanks to the original Transmeta team • We continue further advancement of these ideas* * Intel purchased Transmeta IP
Transmeta Efficeon Processor • 6–issue VLIW, in-order, 10 stage pipeline • Provides x86 compatibility • Co-designed with a software system • The Code Morphing Software (CMS) x86 Application and x86 OS x86 ISA Dynamic binary translation CMS RISC ISA VLIW Processor
Dynamic Binary Translation • Intercept executing app • Interpret and profile • Dynamically compile “hot” code to host ISA • Cache and execute • Compiled code fragments are “chained” together • Difficult to chain across an indirect branch • Branch target unknown until runtime x86 Code Interpret Translate Translation Cache Host Processor
Indirect Branch Translation • Several proposals to improve translation efficiency
Indirect Branch Translation • System level translators • Branch target can change by a page-table/segment update • Page permission changes • Page table entry changes (LPN PPN mappings) • Segment limit and permissions • Sharing translations across processes possible, but additional checks needed • Bottomline: Indirect branch translation is expensive in traditional BT systems
Indirect Branch Prediction • Traditional processors often use a BTB • Insufficient: translated to a conditional direct branch • Conditional branches in an indirect branch translation • Multiple conditional branches in an indirect branch translation • Data-dependent on indirect branch target • These branches also become difficult to predict in hardware • Bottomline: Indirect branches lead to poor branch prediction in traditional BT systems
Indirect Branching in Efficeon • Efficeon’s uses HW/SW co-design to address: • Efficient translation of indirect branches • Better branch prediction than in other BT systems • Next, we discuss how Efficeon handles: • x86 return emulation • x86 indirect branch emulation • Native indirect branches and returns
x86 Return Example • Conventional hardware has near-perfect return target prediction • Front-end typically implements a return address stack foo: call bar … call bar … bar: … ret baz foo+2 foo+8 foo+2 baz foo+8 Return Address Stack
x86 Return Translation foo: call bar … call bar … bar: … ret foo’: mov [esp], foo+2 sub esp, esp, 4 br bar’ foo+2’: … • mov [esp], foo+8 • sub esp, esp, 4 • br bar’ • foo+8’: … bar’: … add esp, esp, 4 brlookup_ibtc(esp) Return is emulated using an indirect branch which is difficult to predict • Inliningdoesn’t help foo+2 foo+8
Hardware support: Flook Stack • 16-entry flook stack is explicitly managed by CMS • Intended for emulating call/return in a translation • Flook stack enables RAS-like target prediction • Includes “tag” validation of an entry before consumption
Translation using Flook Stack foo’: movrtemp, <foo+2> movflook_x86_eip, rtemp strtemp, [esp-4] sub esp, esp, 4 precall <foo+2’> br <bar’> foo+2’: … bar‘: … ldrtemp, [esp] movflook_x86_eip, rtemp add esp, esp, 4 ret foo+2 foo+2’ x86 EIP foo+2
x86 Indirect Branch Emulation • Translation similar to the one shown before • Additional architectural registers significantly reduce translation size • Multiple “inlined” comparisons with known targets • Monitoring and update of predicted targets in SW • Compare translation “context” with runtime “context” • Enhance branch prediction by co-design • Software inserts target address in a “link” register • Perform “other” computation • Pipeline front-end fetches instructions at predicted target • Actual branching happens later via a “brl” instruction
Native Indirect Branches • Translation dispatch and interpreter • Both are frequent users of indirect branches • Lousy branch prediction • Software can aid in branch prediction • Link pipe • Push target addresses onto a hardware structure • Do “other” computation • Frontend can fetch the branch target in the mean time • Branch to the top of link pipe using “brlp” • Native subroutines • Link stack • Corollary to a traditional call stack
Summary and Future Work • Indirect branches particularly expensive • Several techniques to speed-up indirect branches • Flook stack • Link register and brl • Link pipe and brlp • Link stack • Future Work: Since Efficeon, other proposals to enhance indirect branch handling in BT system • Hiser et al, Kim et al • Would be interesting to combine some of these ideas
References • Bala et al, “Transparent Dynamic Optimization: The Design and Implementation of Dynamo”, 1999. • Banning et al, “Link pipe system for storage and retrieval of sequences of branch addresses”, 2003. • Banning et al, “Fast look-up of indirect branch destination in a dynamic translation system”, 2006. • Hiser et al, “Evaluating indirect branch handling mechanisms in software dynamic translation system”, 2007. • Kim et al, “Hardware Support for Control Transfers in Code Caches”, 2003. • Kevin Krewell, “Transmeta gets more Efficeon”, Microprocessor Report, 2003.