1 / 25

CDSC CHP Prototyping

CDSC CHP Prototyping. Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou. Accelerator-Rich Architectures: ARC, CHARM, BiN. Goals. Implement the architecture features & supports into the prototype system Architecture Proposals

lydia
Download Presentation

CDSC CHP Prototyping

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CDSC CHP Prototyping Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou

  2. Accelerator-Rich Architectures: ARC, CHARM, BiN

  3. Goals • Implement the architecture features & supports into the prototype system • Architecture Proposals • Architecture-rich CMPs • CHARM • Hybrid cache • Buffer-in NUCA etc • Bridge different thrusts in CDSC

  4. Server-Class Platform: HC-1ex Architecture • XC6vlx760 FPGAs • 80GB/s off-chip bandwidth • 90W Design Power Xeon Quad Core LV5408 40W TDP Tesla C1060 100GB/s off-chip bandwidth 200W TDP

  5. Drawback of the Commodity Systems • Limited ability to customize from the architecture point of view • Board-level integration rather than chip-level integration • Commodity systems can only reach certain-level, we need further innovations

  6. CHP Prototyping Plan • Create the working hardware and software • Use FPGA Extensible Processing Platform (EPP) as the platform • Reuse existing FPGA IPs as much as possible • Working in multiple phases

  7. Target Platforms: Xilinx ML605 and Zynq • Dual-core A9 with programmable logics • Virtex6-based board

  8. CHP Prototyping Phases • ARC Implementation • Phase 1: Basic platform • Accelerator and Software GAM • Phase 2: Adding modularity using available IP • E.g. Xilinx DMAC IP • Phase 3: First step toward BiN • Shared buffer • Customized modules (e.g. DMA-controller, plug-n-play accelerator) • Phase 4: System Enhancement • Crossbar • AXI implementation • CHARM Implementation

  9. ARC Phase 1 Goals • Setting up a basic environment • Multi-core + simple accelerators + OS • Understanding the system interactions in more detail • Simple controller as GAM (global accelerator manager) • Supports sharing at system-level for multiple accelerators of a same type

  10. ARC Phase 1 Example System Diagram Microblaze-0 (Linux with MMU) Mailbox (vecadd) Microblaze-1 (GAM) (Bare-metal; no MMU) FSL Mailbox (vecsub) FSL FSL AXI4 (xbar) FSL AXI4lite (bus) timer mutex uart DDR3 vecadd vecsub vecadd vecsub

  11. ARC Phase-2 Goals • Implementing a system similar to ARC original design • GAM, Accelerator, DMA-Controller, SPM • Adding modularity using available IP • E.g. Xilinx DMAC IP

  12. ARC Phase-2 Architecture

  13. ARC Phase-2 Performance and Power Results • Benchmarking kernel: • Results

  14. ARC Phase-2 Runtime Breakdown

  15. ARC Phase-2 Area Breakdown • Slice Logic Utilization • Number of Slice Registers: 45,283 out of 301,440: 15% • Number of Slice LUTs: 40,749 out of 150,720: 27% • Number used as logic: 32,505 out of 150,720: 21% • Number used as Memory: 5,248 out of 58,400: 8% • Slice Logic Distribution: • Number of occupied Slices: 17,621 out of 37,680: 46% • Number of LUT Flip Flop pairs used: 54,323 • Number with an unused Flip Flop: 14,617 out of 54,323: 26% • Number with an unused LUT: 13,574 out of 54,323: 24% • Number of fully used LUT-FF pairs: 26,132 out of 54,323: 48%

  16. ARC Phase-3 Goals • First step toward BiN: • Shared buffer • Designing our customized modules • Customized DMA-controller • Handles batch TLB misses • Plug-n-play accelerator design • Making the interface general enough at least for a class of accelerators

  17. ARC Phase-3 Architecture • A partial realization of the proposed accelerator-rich CMP onto Xilinx ML605 (Virtex-6) • Global accelerator manager (GAM) for accelerator sharing • Shared on-chip buffers: Much more accelerators than buffer bank resources • Virtual addressing in the accelerators, accelerator virtualization • Virtual addressing DMA, with on-demand TLB filling from core • No network-on-chip, no buffer sharing with cache, no customized instruction in the core

  18. Performance and Power Results • Benchmarking kernel: • Results

  19. Impact of Communication & Computation Overlapping Pipelined Communication & Computation 19% No pipeline

  20. Overhead of Buffer Sharing: Bank Access Contention (1) The 4 logic buffers are allocated to 4 separate buffer banks Reason: AXI bus allow masters simultaneously issue transactions. and the AXI transaction time dominates buffer access time 3.2% The 4 logic buffers are allocated to 1 buffer bank

  21. Overhead of Buffer Sharing: Bank Access Contention (2) The 4 logic buffers are allocated to 4 separate buffer banks 2.7% The 4 logic buffers are allocated to 1 buffer bank

  22. Area Breakdown • Slice Logic Utilization • Number of Slice Registers: 105,969 out of 301,440: 35% • Number of Slice LUTs: 93,755 out of 150,720: 62% • Number used as logic: 80,410 out of 150,720: 53% • Number used as Memory: 7,406 out of 58,400: 12% • Slice Logic Distribution: • Number of occupied Slices: 32,779 out of 37,680: 86% • Number of LUT Flip Flop pairs used: 112,772 • Number with an unused Flip Flop: 25,037 out of 112,772: 22% • Number with an unused LUT: 19,017 out of 112,772: 16% • Number of fully used LUT-FF pairs: 68,718 out of 112,772: 60%

  23. Phase-4 ARC Goals • Finding bottlenecks and system enhancement • Communication bottleneck • Crossbar design instead of AXI-bus • Speed-up AXI non-burst implementation

  24. Accelerator Memory System Design • Crossbar • In addition to previously proposed • now support partial configuration • will not affect working LCAs • Passed on-board test • Hierarchical DMACs • Data transfer between • Main memory • Shared buffer banks • # of buffer banks can be large • want to keep AXI bus size • Hierarchical DMACs and buses OC core IOMMU AXI buses Buffer bank1 LCA1 DMAC1 Buffer bank2 LCA2 Main AXI bus Buffer bank3 DMAC2 LCA3 Buffer bank4 LCA4 DMAC3 Buffer bank9 Select-bit Receiver to DDR GAM

  25. Crossbar Results

More Related