270 likes | 400 Views
Closely-Coupled Timing-Directed Partitioning in HAsim. Michael Pellauer † pellauer@csail.mit.edu. Murali Vijayaraghavan † , Michael Adler ‡ , Arvind † , Joel Emer †‡. † MIT CS and AI Lab Computation Structures Group. ‡ Intel Corporation VSSAD Group. To Appear In: ISPASS 2008. Motivation.
E N D
Closely-CoupledTiming-Directed Partitioningin HAsim Michael Pellauer† pellauer@csail.mit.edu Murali Vijayaraghavan†, Michael Adler‡, Arvind†, Joel Emer†‡ †MIT CS and AI Lab Computation Structures Group ‡Intel Corporation VSSAD Group To Appear In: ISPASS 2008
Motivation • We want to simulate target platforms quickly • We also want to construct simulators quickly • Partitioned simulators are a known technique from traditional performance models: • Micro-architecture • Resource contention • Dependencies • ISA • Off-chip • communication Functional Partition Timing Partition Interaction • Simplifies timing model • Amortize functional model design effort over many models • Functional Partition can be extremely FPGA-optimized
Different Partitioning Schemes • As categorized by Mauer, Hill and Wood: • Source: [MAUER 2002], ACM SIGMETRICS • We believe that a timing-directed solution will ultimately lead to the best performance • Both partitions upon the FPGA
Functional Partition in Software Asim • Get Instruction (at a given Address) • Get Dependencies • Get Instruction Results • Read Memory* • Speculatively Write Memory* (locally visible) • Commit or Abort instruction • Write Memory* (globally visible) * Optional depending on instruction type
F D X C F D X R C F D X W C W F D X R A F D X X C W Execution in Phases The Emer Assertion: All data dependencies can be represented via these phases
Detailed Example: 3 Different Timing Models • Executing the same instruction sequence:
Functional Partition in Hardware? • Requirements • Support these operations in hardware • Allow for out-of-order execution, speculation, rollback • Challenges • Minimize operation execution times • Pipeline wherever possible • Tradeoff between BRAM/multiport RAMs • Race conditions due to extreme parallelism
Functional Partition As Pipeline • Conveys concept well, but poor performance Timing Model Token Gen Fet Dec Exe Mem LCom GCom Functional Partition Memory State Register State RegFile
Implementation:Large Scoreboards in BRAM • Series of tables in BRAM • Store information about each in-flight instruction • Tables are indexed by “token” • Also used by the timing partition to refer to each instruction • New operation “getToken” to allocate a space in the tables
Implementing the Operations • See paper for details (also extra slides)
Assessment:Three Timing Models • Unpipelined Target • MIPS R10K-like out-of-order superscalar 5-Stage Pipeline
Assessment:Target Performance • Targets have idealized memory hierarchy
Assessment:Simulator Performance • Some correspondence between target and functional partition is very helpful
Assessment:Reuse and Physical Stats • Where is functionality implemented: • FPGA usage: Virtex IIPro 70 Using ISE 8.1i
Func Reg + Datapath Timing Model C Func Reg + Datapath Timing Model D Future Work:Simulating Multicores Interaction occurs here • Scheme 1: Duplicate both partitions • Scheme 2: Cluster Timing Parititions Timing Model A Func Reg + Datapath Functional Memory State Timing Model B Func Reg + Datapath Use a context ID to reference all state lookups Timing Model A Timing Model C Functional Reg State + Datapath Timing Model B Timing Model D Functional Memory State Interaction still occurs here
Future Work: Simulating Multicores • Scheme 3: Perform multiplexing of timing models themselves • Leverage HASim A-Ports in Timing Model • Out of scope of today’s talk Timing Model A Timing Model B Timing Model C Functional Reg State + Datapath Timing Model D Functional Memory State Use a context ID to reference all state lookups Interaction still occurs here
Future Work:Unifying with the UT-FAST model • UT-FAST is Functional-First • This can be unified into Timing-Directed • Just do “execute-at-fetch” Func Partition Timing Partition functional emulator running in software execution stream FPGA resteer Emulator execution stream resteer Ø Ø functional emulator running in software Ø Ø
Summary • Described a scheme for closely-coupled timing-directed partitioning • Both partitions are suitable for on-FPGA implementation • Demonstrated such a scheme’s benefits: • Very Good Reuse, Very Good Area/Clock Speed • Good FPGA-to-Model Cycle Ratio: • Caveat: Assuming some correspondence between timing model and functional partitions (recall the unpipelined target) • We plan to extend this using contexts for hardware multiplexing [Chung 07] • Future: rare complex operations (such as syscalls) could be done in software using virtual channels
Questions? pellauer@csail.mit.edu
Extra Slides pellauer@csail.mit.edu