330 likes | 492 Views
A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures. Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University. Beyond Homogeneous Parallelism. General-Purpose Cores (CPU). Programmable
E N D
A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, David BrooksHarvard University
Beyond Homogeneous Parallelism General-Purpose Cores (CPU) Programmable Accelerators (DSP, GPU) Application-Specific Accelerator (ASIP, ASIC) Energy Efficiency Flexibility Programmability Design Cost
Today’s SoC OMAP 4 SoC
Today’s SoC ARM Cores Face Audio GPU DSP Imaging DSP Video DMA USB SD System Bus USB DMA Secondary Bus Secondary Bus Tertiary Bus OMAP 4 SoC
Today’s SoC Apple A7 Harvard VLSI-ARCH Group SoCTapeout
Today’s SoC GPU/DSP CPU CPU Mem Inter- face Buses Acc Acc Acc Acc Acc Acc Acc Acc Acc
Future Accelerator-Centric Architectures GPU/DSP Big Cores Small Cores Memory Interface Shared Resources Sea of Fine-Grained Accelerators Flexibility Design Cost Programmability How to decompose an application to accelerators? How to rapidly design lots of accelerators? How to design and manage the shared resources?
Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Shared Memory/Interconnect Models Aladdin Unmodified C-Code Power/Area Accelerator Specific Datapath Private L1/ Scratchpad Accelerator Design Parameters (e.g., # FU, mem. BW) Performance “Accelerator Simulator” Design Accelerator-Rich SoC Fabrics and Memory Systems “Design Assistant” Understand Algorithmic-HW Design Space before RTL Flexibility Programmability Design Cost
Future Accelerator-Centric Architecture GPU/DSP Big Cores Small Cores Memory Interface Shared Resources Sea of Fine-Grained Accelerators
Future Accelerator-Centric Architecture GPU/DSP Big Cores Small Cores Memory Interface Shared Resources Sea of Fine-Grained Accelerators Aladdin can rapidly evaluate large design space of accelerator-centric architectures.
Aladdin Overview Optimization Phase Optimistic IR Idealistic DDDG Initial DDDG Power/Area Models Dynamic Data Dependence Graph (DDDG) C Code Performance Activity Resource Constrained DDDG Program Constrained DDDG Acc Design Parameters Power/Area Realization Phase
Aladdin Overview Optimization Phase Optimistic IR Idealistic DDDG Initial DDDG Power/Area Models C Code Performance Activity Resource Constrained DDDG Program Constrained DDDG Acc Design Parameters Power/Area Realization Phase
From C to Design Space C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i];
From C to Design SpaceIR Dynamic Trace 0. r0=0 //i = 0 r4=load (r0 + r1) //load a[i] r5=load (r0 + r2) //load b[i] r6=r4 + r5 store(r0 + r3, r6) //store c[i] r0=r0 + 1 //++i r4=load(r0 + r1) //load a[i] r5=load(r0 + r2) //load b[i] r6=r4 + r5 store(r0 + r3, r6) //store c[i] r0 = r0 + 1 //++i … C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i];
From C to Design SpaceInitial DDDG IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i … 0. i=0 5. i++ 1. ld a 2. ld b C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; 10. i++ 6. ld a 7. ld b 3. + 11. ld a 12. ld b 8. + 4. st c 13. + 9. st c 14. st c
From C to Design SpaceIdealistic DDDG IR Trace: 0. r0=0 //i = 0 1. r4=load (r0 + r1) //load a[i] 2. r5=load (r0 + r2) //load b[i] 3. r6=r4 + r5 4. store(r0 + r3, r6) //store c[i] 5. r0=r0 + 1 //++i 6. r4=load(r0 + r1) //load a[i] 7. r5=load(r0 + r2) //load b[i] 8. r6=r4 + r5 9. store(r0 + r3, r6) //store c[i] 10.r0 = r0 + 1 //++i … 0. i=0 0. i=0 5. i++ 10. i++ 6. ld a 7. ld b 2. ld b 1. ld a 11. ld a 12. ld b 5. i++ 2. ld b 1. ld a C Code: for(i=0; i<N; ++i) c[i] = a[i] + b[i]; 10. i++ 6. ld a 7. ld b 3. + 3. + 8. + 13. + 11. ld a 12. ld b 8. + 4. st c 4. st c 14. st c 9. st c 13. + 9. st c 14. st c
From C to Design SpaceOptimization Phase: C->IR->DDDG • Include application-specific customization strategies. • Node-Level: • Bit-width Analysis • Strength Reduction • Tree-height Reduction • Loop-Level: • Remove dependences between loop index variables • Memory Optimization: • Memory-to-Register Conversion • Store-Load Forwarding • Store Buffer • Extensible • e.g. Model CAM accelerator by matching nodes in DDDG
From C to Design SpaceOne Design Resource Activity Idealistic DDDG 0. i=0 0. i=0 5.i++ 15. i++ 10. i++ MEM MEM MEM MEM 1. ld a 2. ld b MEM MEM 1. ld a 6. ld a 16. ld a 17. ld b 7. ld b 11. ld a 2. ld b 12. ld b + 3. + 18. + 13. + 8. + 3. + 4. st c 19. st c 14. st c 4. st c 9. st c + 5.i++ 6. ld a 7. ld b • Acc Design Parameters: • Memory BW <= 2 • 1Adder + 8. + 9. st c Cycle
From C to Design SpaceAnother Design Resource Activity Idealistic DDDG + 0. i=0 5.i++ 15. i++ 0. i=0 10. i++ 5.i++ MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM 1. ld a MEM 6. ld a 16. ld a 1. ld a 17. ld b 6. ld a 7. ld b 11. ld a 2. ld b 12. ld b 7. ld b 2. ld b + + 18. + 13. + 8. + 3. + 3. + 8. + 19. st c 14. st c 4. st c 9. st c 4. st c 9. st c + + 15. i++ 10. i++ • Acc Design Parameters: • Memory BW <= 4 • 2Adders 16. ld a 17. ld b 11. ld a 12. ld b + + 18. + 13. + 19. st c 14. st c Cycle
From C to Design SpaceRealization Phase: DDDG->Estimates • Constrain the DDDG with program and user-defined resource constraints • Program Constraints • Control Dependence • Memory Ambiguation • Resource Constraints • Loop-level Parallelism • Loop Pipelining • Memory Ports • # of FUs (e.g., adders, multipliers)
From C to Design SpacePower-Performance per Design • Acc Design Parameters: • Memory BW <= 4 • 2Adders Power • Acc Design Parameters: • Memory BW <= 2 • 1Adder Cycle
From C to Design SpaceDesign Space of an Algorithm Power Cycle
Aladdin Validation Aladdin C Code Power/Area Performance Design Compiler Verilog Activity ModelSim
Aladdin Validation Aladdin C Code Power/Area Performance Design Compiler RTL Designer Verilog Activity Vivado HLS HLS C Tuning ModelSim
Aladdin enables rapid design space exploration for accelerators. 7 mins Aladdin C Code Power/Area Performance 52 hours Design Compiler RTL Designer Verilog Activity Vivado HLS HLS C Tuning ModelSim
Aladdin enablespre-RTL simulation of accelerators with the rest of the SoC. GPU DRAMSim2 GPGPU-Sim Big Cores MARSx86 ... Small Cores XIOSim… Memory Interface Shared Resources Cacti/Orion2 Sea of Fine-Grained Accelerators
Modeling Accelerators in a SoC-like Environment Core Acc Core Cache Memory Acc Core Cache Memory
Aladdin: A pre-RTL, Power-Performance Accelerator Simulator Architectures with 1000s of accelerators will be radically different; New design tools are needed. Aladdin enables rapid design space exploration of future accelerator-centric platforms. You can find Aladdin at http://vlsiarch.eecs.harvard.edu/aladdin