1 / 24

TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS. Aaron Severance University of British Columbia Advised by Guy Lemieux. Our Problem. We use overlays for data processing Partially/fully fixed processing elements Virtual CGRAs, soft vector processors Memory:

may
Download Presentation

TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUT FPGA APPLICATIONS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TPUTCACHE: HIGH-FREQUENCY, MULTI-WAY CACHE FOR HIGH-THROUGHPUTFPGA APPLICATIONS Aaron Severance University of British Columbia Advised by Guy Lemieux

  2. Our Problem • We use overlays for data processing • Partially/fully fixed processing elements • Virtual CGRAs, soft vector processors • Memory: • Large register files/scratchpad in overlay • Low latency, local data • Trivial (large DMA): burst to/from DDR • Non-trivial?

  3. Scatter/Gather • Data dependent store/load • vscatteradr_ptr, idx_vect, data_vect • for i in 1..N • adr_ptr[idx_vect[i]] <= data_vect[i] • Random narrow (32-bit) accesses • Waste bandwidth on DDR interfaces

  4. If Data Fits on the FPGA… • BRAMs with interconnect network • General network… • Not customized per application • Shared: all masters <-> all slaves • Memory mapped BRAM • Double-pump (2x clk) if possible • Banking/LVT/etc. for further ports

  5. Example BRAM system

  6. But if data doesn’t fit… (oversimplified)

  7. So Let’s Use a Cache • But a throughput focused cache • Low latency data held in local memories • Amortize latency over multiple accesses • Focus on bandwidth

  8. Replace on-chip memory or augment memory controller? • Data fits on-chip • Want BRAM like speed, bandwidth • Low overhead compared to shared BRAM • Data doesn’t fit on-chip • Use ‘leftover’ BRAMs for performance

  9. TputCache Design Goals • Fmax near BRAM Fmax • Fully pipelined • Support multiple outstanding misses • Write coalescing • Associativity

  10. TputCache Architecture • Replay based architecture • Reinsert misses back into pipeline • Separate line fill/evict logic in background • Token FIFO for completing requests in order • No MSHRs for tracking misses • Fewer muxes (only single replay request mux) • 6 stage pipeline -> 6 outstanding misses • Good performance with high hit rate • Common case fast

  11. TputCache Architecture

  12. Cache Hit

  13. Cache Miss

  14. Evict/Fill Logic

  15. Area & Fmax Results • Reaches 253MHz compared to 270MHz BRAM fmax on Cyclone IV • 423MHz compared to 490MHz BRAM fmax on Stratix IV • Minor degredation with increasing size, associativity • 13% to 35% extra BRAM usage for tags, queues

  16. Benchmark Setup • TputCache • 128kB, 4-way, 32-byte lines • MXP soft vector processor • 16 lanes, 128kB scratchpad memory • Scatter/Gather memory unit • Indexed loads/stores per lane • Doublepumping port adapters • TputCache runs at 2x frequency of MXP

  17. MXP Soft Vector Processor

  18. Histogram • Instantiate a number of Virtual Processors (VPs) mapped across lanes • Each VP histograms part of the image • Final pass to sum VP partial histograms

  19. Hough Transform • Convert an image to 2D Hough Space (angle, radius) • Each vector element calculates the radius for a given angle • Adds pixel value to counter

  20. Motion Compensation • Load block from reference image, interpolate • Offset by small amount from location in current image

  21. Future Work • More ports needed for scalability • Share evict/fill BRAM port with 2nd request • Banking (sharing same evict/fill logic) • Multiported BRAM designs • Write cache • Allocate on write currently • Track dirty state of bytes in BRAMs 9th bit • Non-blocking behavior • Multiple token FIFOs (one per requestor)?

  22. FAQ • Coherency • Envisioned as only/LLC • Future work • Replay loops/problems • Random replacement + associativity • Power expected to be not great…

  23. Conclusions • TputCache: alternative to shared BRAM • Low overhead (13%-35% extra BRAM) • Nearly as high fmax (253MHz vs 270MHz) • More flexible than shared BRAM • Performance degrades gradually • Cache behavior instead of manual filling

  24. Questions? • Thank you

More Related