240 likes | 370 Views
TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS. Aaron Severance University of British Columbia Advised by Guy Lemieux. Our Problem. We use overlays for data processing Partially/fully fixed processing elements Virtual CGRAs, soft vector processors Memory:
E N D
TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUTFPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux
Our Problem • We use overlays for data processing • Partially/fully fixed processing elements • Virtual CGRAs, soft vector processors • Memory: • Large register files/scratchpad in overlay • Low latency, local data • Trivial (large DMA): burst to/from DDR • Non-trivial?
Scatter/Gather • Data dependent store/load • vscatteradr_ptr, idx_vect, data_vect • for i in 1..N • adr_ptr[idx_vect[i]] <= data_vect[i] • Random narrow (32-bit) accesses • Waste bandwidth on DDR interfaces
If Data Fits on the FPGA… • BRAMs with interconnect network • General network… • Not customized per application • Shared: all masters <-> all slaves • Memory mapped BRAM • Double-pump (2x clk) if possible • Banking/LVT/etc. for further ports
So Let’s Use a Cache • But a throughput focused cache • Low latency data held in local memories • Amortize latency over multiple accesses • Focus on bandwidth
Replace on-chip memory or augment memory controller? • Data fits on-chip • Want BRAM like speed, bandwidth • Low overhead compared to shared BRAM • Data doesn’t fit on-chip • Use ‘leftover’ BRAMs for performance
TputCache Design Goals • Fmax near BRAM Fmax • Fully pipelined • Support multiple outstanding misses • Write coalescing • Associativity
TputCache Architecture • Replay based architecture • Reinsert misses back into pipeline • Separate line fill/evict logic in background • Token FIFO for completing requests in order • No MSHRs for tracking misses • Fewer muxes (only single replay request mux) • 6 stage pipeline -> 6 outstanding misses • Good performance with high hit rate • Common case fast
Area & Fmax Results • Reaches 253MHz compared to 270MHz BRAM fmax on Cyclone IV • 423MHz compared to 490MHz BRAM fmax on Stratix IV • Minor degredation with increasing size, associativity • 13% to 35% extra BRAM usage for tags, queues
Benchmark Setup • TputCache • 128kB, 4-way, 32-byte lines • MXP soft vector processor • 16 lanes, 128kB scratchpad memory • Scatter/Gather memory unit • Indexed loads/stores per lane • Doublepumping port adapters • TputCache runs at 2x frequency of MXP
Histogram • Instantiate a number of Virtual Processors (VPs) mapped across lanes • Each VP histograms part of the image • Final pass to sum VP partial histograms
Hough Transform • Convert an image to 2D Hough Space (angle, radius) • Each vector element calculates the radius for a given angle • Adds pixel value to counter
Motion Compensation • Load block from reference image, interpolate • Offset by small amount from location in current image
Future Work • More ports needed for scalability • Share evict/fill BRAM port with 2nd request • Banking (sharing same evict/fill logic) • Multiported BRAM designs • Write cache • Allocate on write currently • Track dirty state of bytes in BRAMs 9th bit • Non-blocking behavior • Multiple token FIFOs (one per requestor)?
FAQ • Coherency • Envisioned as only/LLC • Future work • Replay loops/problems • Random replacement + associativity • Power expected to be not great…
Conclusions • TputCache: alternative to shared BRAM • Low overhead (13%-35% extra BRAM) • Nearly as high fmax (253MHz vs 270MHz) • More flexible than shared BRAM • Performance degrades gradually • Cache behavior instead of manual filling
Questions? • Thank you