200 likes | 293 Views
RAMP Infrastructure. Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007. RAMP: An infrastructure to build simulators using FPGAs. CPU. CPU. CPU. CPU. Target Model. Interconnect Network. DRAM. Host Platform. Run Target Model on Host Platform. Hard Work.
E N D
RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007
CPU CPU CPU CPU Target Model Interconnect Network DRAM Host Platform Run Target Model on Host Platform Hard Work
Reduce, Reuse, Recycle • Reduce effort to build target models • Users just build components, infrastructure handles connections (The RDL Compiler) • Reuse components by having good abstractions • Across different target models • Across different host platforms • XUP, Calinx, BEE2, BEE3, also Altera (see Greg) • Recycle existing IP for use as simulation models • Commercial processor RTL is its own model
Unit A Unit B Unit C Pipeline Channel FIFO Channel RAMP Target Models Units • Relatively large chunks of functionality • e.g., processor + L1 cache • User-written in some HDL or software Channels • Point-point, undirectional, two kinds: • FIFO channel: Flow-controlled interface • Pipeline channel: Simple shift register, bits drop off end • Generated by RAMP infrastructure
Target FIFO Channel Parameters • Need buffering of at least (Forward+Reverse) latency to get full bandwidth over link • RAMP infrastructure instantiates channel with desired parameters D D Datawidth RDY ENQ RDY Buffering DEQ Forward Latency Reverse Latency
Target Pipeline Channel Parameters • Only recommended for expert use in target models • (Should use FIFO channels and latency-insensitive protocols in target design) D D Datawidth Forward Latency
Unit A Unit B Unit C RAMP Description Language (RDL) Target: [ Greg Gibeling, UCB ] • User describes target model topology, channel parameters, and (manual) mapping to host platform FPGAs using RDL • RDL Compiler (RDLC) generates configurations Generated links carry channels RDLC Host: Unit B Generated Unit Wrappers Unit A Unit C FPGA2 FPGA1
Virtualized RTL Improves FPGA Resource Usage • RAMP allows units to run at varying target-host clock ratios to optimize area and overall performance • Example 1: Multiported register file • Example, Sun Niagara has 3 read ports and 2 write ports to 6KB of register storage • If RTL mapped directly, requires 48K flip-flops • Slow cycle time, large area • If mapping into block RAMs (one read+one write per cycle), takes 3 host cycles and 3x2KB block RAMs • Faster cycle time (~3X) and far less resources • Example 2: Large L2/L3 caches • Current FPGAs only have ~1MB of on-chip SRAM • Use on-chip SRAM to build cache of active piece of L2/L3 cache, stall target cycle if access misses and fetch data from off-chip DRAM
Start/Done Timing Interface Wrapper • Wrapper generated by RDL asserts “Start” on the physical FPGA cycle when the inputs to the unit are ready for the next target cycle • Unit asserts “Done” when it finishes the target cycle and its outputs are ready • Unit can take variable amount of time • Unvirtualized RTL unit can connect “Done” to “Start” (but must not clock until “Start”) Start In1 Unit Out In2 Done
Pipeline target channel implemented as distributed FIFO with at least L buffers Host: RDYs Start Start RDY Unit A Unit B D D ENQ DEQ Done Done DEQs Distributed Timing Example Unit B Unit A D Target: Latency L
Timing Target FIFO Channel • Can build timed credit-based flow control (CBFC) FIFO inside Target model, using pipeline channels for communicating data forwards and credits backwards • But this puts two CBFCs in series (one in target unit, one hidden in host implementation of pipeline channels) • RDL can generate a unified FIFO that merges both of these behind the FIFO interface Target: Latency L D D D D Credit control RDY RDY ENQ DEQ Credits
Other Automatically Generated Networks • Control network has workstation as master and every unit as slave device • Memory-mapped interface with block transfers • Used for initialization, stats gathering, debugging, and monitoring • Units can connect to DRAM resources outside of timed target channels • Used to support emulation and virtualization state • Units can communicate with each other outside of timed target channels • Support arbitrary communication. E.g., for distributed stats gathering
Simulator Design Choices • Structural Analog versus Highly Virtualized • Functional-only versus Functional+Timing • Timing via (virtual) RTL design versus separate functional and timing models • Hybrid software/hardware simulators We’re trying to build layers of abstractions that are useful to all types of simulator Also, trying to make modules in different styles inter-operate