270 likes | 379 Views
Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions. Outline. Heterogeneous MPSoCs Specialization is a growing trend Accelerator-rich MPSoC architecture MPSoCs with many accelerators Previous works
E N D
Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions
Outline • Heterogeneous MPSoCs • Specialization is a growing trend • Accelerator-rich MPSoC architecture • MPSoCs with many accelerators • Previous works • Quantitative exploration of current accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization • The proposed accelerator centric architecture template - Implementation - Evaluation
Outline • Heterogeneous MPSoCs • Specialization is a growing trend • Accelerator-rich MPSoC architecture • MPSoCs with many accelerators • Previous work • Quantitative exploration of accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization • The proposed accelerator centric architecture template - Implementation - Evaluation
Heterogeneous MPSoCs • Heterogeneous MPSoCs • Integrated solutions for a group of evolving markets • ILP (e.g. CPU, DSP, or even GPU) • Flexibility • - Power dissipation • Custom-HW Accelerators (ACCs) for compute-intensive kernels • Power efficiency • Cost • Inflexibility • What is the trend?
Specialization as a MPSoC trend • Increasing demands for highperformance low power computing • Market examples: • Embedded vision • Software Define Radio (ADR) • Cyber Physical Systems (CPS) • Tens billion of operations per second • Less than few watts power -Trend: Domain specific specialization • Proliferating number of ACCs in systems • ACC-Rich MPSoC
Outline • Heterogeneous MPSoCs • Specialization is a growing trend • Accelerator-rich MPSoC architecture • MPSoCs with many accelerators • Previous work • Quantitative exploration of accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization • The proposed accelerator centric architecture template - Implementation - Evaluation
Principals of current accelerator-rich MPSoC 1. Input Done 2.DMA Start 3.DMA Done 4.DMA Start 5.DMA Done 6.ACC1 Start 7.ACC1 Done 8.DMA Start 9.DMA Done • 10.DMA Start • 11.DMA Done 12.Output start 13-Output Done • ILP+HWACC composition • HW-ACC • Executes Compute-intense kernels/apps • ILP • Executes remaining applications • Orchestrates HWACCs / coordinate data movement • On-chip scratchpad memory (SPM) • Keeps data between ILP and ACCs on-chip • Avoid costly off-chip memory access
MPSoC with many accelerators • Control and interrupt lines • - ACC configuration • Centralized vs. dedicated DMA • - Stream data transfer DMA DMA DMA DMA DMA DMA SPM: Scratch Pad Memory • Scratch Pad Memory (SPM) • 2 per accelerator , 1 per I/O • To hold input job
Challenges with increasing number of interrupts NEED to quantitatively consider this architecture! 1- Memory requirement - Two SPM per each ACC • One SPM per each Interfaces • Shared memory to hold data handed over the accelerators 2- High volume of traffic over system fabric - No point to point connections between ACC - Required DMA data transfers 3-ILP synchronization - Among accelerators, IO Interfaces and DMA transfers
Outline • Heterogeneous MPSoCs • Specialization is a growing trend • Accelerator-rich MPSoC architecture • MPSoCs with many accelerators • Previous work • Quantitative exploration of accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization • The proposed accelerator centric architecture template - Implementation - Evaluation
Previous works on composing ACC • Composing bigger applications out of many accelerates like Accelerator-Rich CMPs[1], CHARM[2] • Imposing a considerable traffic and considerable on-chip buffers for accelerator data exchange • ILP load to orchestrate the system composed of accelerators • J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, and G. Reinman. Architecture support for accelerator-rich cmps. In Proceedings of the 49th Annual Design Automation Conference, DAC ’12, pages 843–849, 2012. • M. Lyons, M. Hempstead, G.-Y. Wei, and D. Brooks. The accelerator store framework for high-performance, low-power accelerator-based systems.Computer Architecture Letters, 9(2):53 –56, feb. 2010.
Outline • Heterogeneous MPSoCs • Specialization is a growing trend • Accelerator-rich MPSoC architecture • MPSoCs with many accelerators • Previous work • Quantitative exploration of accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization • The proposed accelerator centric architecture template - Implementation - Evaluation
Quantitative exploration of accelerator-rich MPSoC; WHY and HOW • Applicability of quantitative exploration • Quantifying the potential challenges • Exposing the ACC-rich bottlenecks as # of ACCs increases • Helping system architects for proper sizing of systems knobs (SPM sizes, # of ACCs, Communication BW) • Motivating our proposed arch-template solution • Approaches of quantitative exploration 1- First order mathematic based analysis 2- Simulation based analysis of ACC-rich MPSoC
Exploration overview • Assumptions • One HD resolution frame as input • Divided into smaller jobs • Memory on chip • Avoid off-chip memory for now • Exploration steps • Memory requirement as #ACC increases • Sizing SPM to satisfy memory budget limitation • Interrupt rate load on ILP
Memory size analysis (calculation based) • Memory size = SPMs + shared memory • SPM holds one job • Job size determines minimum size of SPM and shared memory • Shared memory holds all jobs exchanged among ACCs • More ACCs requires larger memory • Bigger job needs larger memory Limiting memory budget • Sizing job size with respect to memory budget
Job sizing (calculation based) • Count the number of interrupts • Measure ILP responsibility to response Interrupts • Smaller job size issues more interrupts to ILP • - Responsibility of ILP to synchronize ACCs transactions • The lower the size of memory, the smaller the size of job • The more #accelerators, the smaller job size
Simulation platform SCE refinement • Using SpecC SLDL to develop a simulation model • Scalable # of ACCs • Different/same data rate • ILPs • DMAs • Mummeries (SPM, shared memory) • On-chip and off-chip memory • Generating ACC-Rich simulation model • BFM AMBA-AHB Communication fabric • ARM 9 (ISA v6) for ILP execution • Priority based • Dedicated interrupt line • Centralized DMAs
# of interrupt by scaling #ACC (simulation based) Smaller memory/more ACCs -> smaller Job • More interrupts to the ILP with smaller job size • - Significant utilization or even over saturation of ILP only because of driving accelerators • # of interrupt vs. the number of accelerators • For different size of on-chip memory
Communication overhead analysis (calculation based) • Communication overhead = data exchanged through the system fabric • More ACCs, heavier traffic on system fabric
Exploration Summery • Problems affiliated with current accelerator-rich architecture • On-chip memory requirements • ILP synchronization load • Heavy communication traffic on system fabric • Demands toward improved ACC-centric design • Tackling the challenges of current ACC-rich architecture
Outline • Heterogeneous MPSoCs • Specialization is a growing trend • Accelerator-rich MPSoC architecture • MPSoCs with many accelerators • Previous work • Quantitative exploration of accelerator-rich MPSoC - Huge memory demand - Immerse communication traffic - Overwhelming Synchronization • The proposed accelerator centric architecture template - Implementation - Evaluation
The goals of the proposed ACC-centric arcitecture • The proposed solution • An autonomous accelerator chain • Relieving ILP’s synchronization load • Point to point connections between accelerators • No need for larger SPM per each accelerator • No frequent DMA data transfers • No heavy traffic on system fabric
Simulation platform SCE refinement • Modifying the developed SpecC model to support autonomous chain of accelerator • Gateways to manage the chain • Creating another ACC-Rich simulation model • BFM AMBA-AHB Communication fabric • ARM 9 (ISA v6) for ILP execution • Dedicated interrupt line from gateways to ILP • Centralized DMA
The proposed accelerator-centric architecture template • Point to point accelerator connections • No much memory requirement • Not many DMA data transfer • Autonomous ACC chain: • Light ILP synchronization load no matter how many accelerators 1. DMA brings data to the input gateway’s SPM 2. Input gateway receives data and starts to pass data through the chain 3. Chain works on data 4. Output gateway gathers data in SPM 5. DMA brings data to memory 3 2 4 1 5 • Gateways controlled by ILP to manage the whole chain of accelerators • SPM to receive/send data from/to memory • Control lines from ILP to gateways for configuration • Interrupt lines from gateways to ILP • Point to point connections in chain with small buffer in between • Chain works independence of ILP
Evaluation • MORE ACC: • Current arch: Smaller job • Proposed arch: almost the same job • MORE ACC: • Current arch: Linear growth in memory requirement • Proposed arch: almost constant memory requirement • MORE ACC: • Current arch: Heavier traffic • Proposed arch: almost the same data traffic • MORE ACC: • Current arch: exponential growth in interrupts • Proposed architecture: The same number of interrupts
Summary • Specialization as a growing trend in CMPs • Accelerator rich architectures • Exploration of the challenges in current accelerator rich architecture • Memory requirement • Communication overhead • Synchronization load • The proposed accelerator-centric architecture template • Autonomous accelerator chain • No large memory requirement • No heavy communication traffic • No critical amount of required synchronization
Question? Again, Thanks to Professor Schirnerfor all his support… Thanks to Hamedfor what I’ve been learning from him, Thank you all ESL members for your attendance!