400 likes | 530 Views
HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing. Michael Adler Elliott Fleming Michael Pellauer Joel Emer. Simulating Multicores. Simulating an N- multicore target Fundametally N times the work Plus on-chip network. CPU. CPU. CPU. CPU. CPU. Network.
E N D
HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing Michael AdlerElliott FlemingMichael PellauerJoel Emer
Simulating Multicores • Simulating an N-multicore target • FundametallyN times the work • Plus on-chip network CPU CPU CPU CPU CPU Network CPU CPU CPU CPU Duplicating cores will quickly fill FPGA Multi-FPGA will slow simulation
Trading Time for Space • Can leverage separation of model clock and FPGA clock to save space • Two techniques: serialization and time-multiplexing • But doesn’t this just slow down our simulator? • The tradeoff is a good idea if we can: • Save a lot of space • Improve FPGA critical path • Improve utilization • Slow down rare events, keep common events fast LI approach enables a wide range of tradeoff options
Example Tradeoff: Multi-Port Register File • 2 Read Ports, 2 Write Ports • 5-bit index, 32-bit data • Reads take zero clock cycles • Virtex 2Pro FPGA: 9242(>25%) slices, 104 MHz 2R/2W Register File rd addr 1 rd addr 2 rd val 1 wr addr 1 wr val 1 rd val 2 wr addr2 wr val 2
Trading Time for Space • Simulate the circuit sequentially using BlockRAM • 94 slices (<1%), 1 BlockRAM, 224 MHz (2.2x) • Simulation rate is 224 / 3 = 75 MHz FPGA-cycle to Model Cycle Ratio (FMR) rd addr 1 1R/1W Block RAM rd addr 2 rd val 1 wr addr 1 wr val 1 rd val 2 FSM wr addr 2 wr val 2 • Each module may have different FMR • A-Ports allow us to connect many such modules together • Maintain a consistent notion of model time
Part FET IMEM Example: Inorder Front End Legend: Ready to simulate? 1 redirect No Yes (from Back End) training 1 Line Pred (from Back End) Branch Pred 1 2 fault • Modules may simulate at any wall-clock rate • Corollary: adjacent modules may not be simulating the same model cycle vaddr pred 1 mispred 0 1 inst or fault 0 first FET ITLB IMEM PC Resolve Inst Q 0 1 1 1 0 vaddr paddr enq or drop 0 deq paddr 0 rspImm 0 I$ 1 rspDel 1 slot
Simulator “Slip” • Adjacent modules simulating different cycles! • In paper: distributed resynchronization scheme • This can speed up simulation • Case study: Achieved 17% better performance than centralized controller • Can get performance = dynamic average FET DEC FET DEC vs 1 1 Let’s see how...
Traditional Software Simulation = model cycle
Global Controller “Barrier” Synchronization = model cycle Challenges in Conducting Compelling Architecture Research
run-ahead in time until buffering fills long-running ops can overlap even if on different CC A-Ports Distributed Synchronization Takeaway: LI makes serialization tradeoffs more appealing
[With Parashar, Adler] Leveraging Latency-Insensitivity • Modeling large caches • Expensive instructions LEAP SRAM (MBs, 10s CCs) L2$ System Memory (GBs, 100s CCs) RAM 256 KB Cache Controller BRAM (KBs, 1 CC) 1 1 LEAP Scratchpad FPGA LEAP FPU Instruction Emulator (M5) RRR 1 1 EXE FPGA CPU
Time-Multiplexing: A Tradeoff to Scale Multicores(resume at 3:45)
Drawbacks: Probably won’t fit Low utilization of functional units Benefits: Simple to describe Maximum parallelism Multicores Revisited CORE 0 CORE 1 CORE 2 state state state What if we duplicate the cores?
Module Utilization • A module is unutilized on an FPGA cycle if: • Waiting for all input ports to be non-empty or • Waiting for all output ports to be non-full • Case Study: In-order functional units were utilized 13% of FPGA cycles on average FET DEC FET DEC 1 1 1 1
Benefits: Better unit utilization Time-Multiplexing: First Approach state state state physical pipeline virtual instances Duplicate state, Sequentially share logic • Drawbacks: • More expensive than duplication(!)
Benefits: Much better area Good unit utilization Round-Robin Time Multiplexing state state state • Need to limit impact of slow events • Pipeline at a fine granularity • Need a distributed, controller-free mechanism to coordinate... physical pipeline • Fix ordering, remove multiplexors • Drawbacks: • Head-of-line blocking may limit performance
Port-Based Time-Multiplexing • Duplicate local state in each module • Change port implementation: • Minimum buffering: N * latency + 1 • Initialize each FIFO with: # of tokens = N * latency • Result: Adjacent modules can be simultaneously simulating different virtual instances
IMEM FET The Front End Multiplexed Legend: Ready to simulate? 1 redirect No CPU 1 CPU 2 (from Back End) training 1 Line Pred (from Back End) Branch Pred 1 2 fault vaddr pred 1 mispred 0 1 inst or fault 0 first FET ITLB IMEM PC Resolve Inst Q 0 1 1 1 0 vaddr paddr enq or drop 0 deq paddr 0 rspImm 0 I$ 1 rspDel 1 slot
Problem: On-Chip Network CPU L1/L2 $ msg credit [0 1 2] [0 1 2] CPU 1 L1/L2 $ CPU 0 L1/L2 $ CPU 2 L1/L2 $ r r r r Memory Control • Problem: routing wires to/from each router • Similar to the “global controller” scheme • Also utilization is low msg msg r credit credit router
Multiplexing On-Chip Network Routers Router 0 Router 1 Router 0..3 reorder σ(x) = (x + 1) mod4 Router 3 Router 2 reorder σ(x) = (x + 2) mod4 σ(x) = (x + 3) mod4 reorder 1 2 3 1 2 3 Simulate the network without a network 2 3 0 2 3 0 3 0 1 3 0 1 0 1 2 0 1 2
Ring/Double Ring Topology Multiplexed “from prev” Router 0 Router 1 Router 2 Router 3 “to next” Router 0..3 σ(x) = (x + 1) mod4 ??? 1 3 2 0 3 1 0 2 Opposite direction: flip to/from
Implementing Permutations on FPGAs Efficiently 1000 FSM 0001 • Side Buffer • Fits networks like ring/torus (e.g. x+1 modN) • Indirection Table • More general, but more expensive σ(x) = (x + 1) mod4 Move first to Nth Move Nth to first Move every K to N-K Perm Table RAM Buffer
Torus/Mesh Topology Multiplexed Mesh: Don’t transmit on non-existent links
Dealing with Heterogeneous Networks Compose “Mux Ports” with Permutation Ports In paper: generalize to any topology
Typical HAsim Model Leveraging these Techniques • 16-core chip multiprocessor • 10-stage pipeline (speculative, bypassed) • 64-bit Alpha ISA, floating point • 8 KB lockup-free L1 caches • 256 KB 4-way set associative L2 cache • Network: 2 v. channels, 4 slots, x-y wormhole F BP1 BP2 PCC IQ D X DM CQ C ITLB I$ DTLB D$ L/S Q L2$ Route • Single detailed pipeline, 16-way time-multiplexed • 64-bit Alpha functional partition, floating point • Caches modeled with different cache hierarchy • Single router, multiplexed, 4 permutations
Takeaways • The Latency-Insensitive approach provides a unified approach to interesting tradeoffs • Serialization: Leverage FPGA-efficient circuits at the cost of FMR • A-Port-based synchronization can amortize cost by giving dynamic average • Especially if long events are rare • Time-Multiplexing: Reuse datapaths and only duplicate state • A-Port based approach means not all modules are fully utilized • Increased utilization means that performance degradation is sublinear • Time-multiplexing the on-chip network requires permutations
Next Steps • Here we were able to push one FPGA to its limits • What if we want to scale farther? • Next, we’ll explore how latency-Insensitivity can help us scale to multiple FPGAs with better performance than traditional techniques • Also how we can increase designer productivity by abstracting platform
Resynchronizing Ports • Modules follow modified scheme: • If any incoming port is heavy, or any outgoing port is light, simulate next cycle (when ready) • Result: balanced w/o centralized coordination • Argument: • Modules farthest ahead in time will never proceed • Ports in (out) of this set will be light (resp. heavy) • Therefore those modules will try to proceed, but may not be able to • There’s also a set farthest behind in time • Always able to proceed • Since graph is connected, simulating only enables modules, makes progress towards quiescence
Other Topologies • Tree • Butterfly
Generalizing OCN Permutations • Represent model as Directed Graph G=(M,P) • Label modules M with simulation order: 0..(N-1) • Partition ports into sets P0..Pm where: • No two ports in a set Pmshare a source • No two ports in a set Pmshare a destination • Transform each Pm into a permutation σm • Forall {s, d} in Pm, σm(s) = d • Holes in range represent “don’t cares” • Always send NoMessage on those steps • Time-Multiplex module as usual • Associate each σmwith a physical port
Results: Multicore Simulation Rate • Must simulate multiple cores to get full benefit of time-multiplexed pipelines • Functional cache-pressure rate-limiting factor