290 likes | 402 Views
Séminaire COSI ’01. Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye. Content. Context and motivations Silicon compilation tools Target architectures Power consumption Related work Partitioning Modeling Power Experimental results Conclusion.
E N D
Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye Séminaire COSI-Roscoff’01
Content • Context and motivations • Silicon compilation tools • Target architectures • Power consumption • Related work • Partitioning • Modeling Power • Experimental results • Conclusion Séminaire COSI-Roscoff’01
Silicon compilation tools • Parallel processor array architectures • Regular and scalable (well suited to FPGAs) • Specialized high-performance data-path • Restricted class of loops • SUREs (uniform dependencies) • Static polyhedral loop domain • Compute intensive nested loops • Image processing (motion estimation, stereo vision) • Signal processing (QR factorization, DLMS) Séminaire COSI-Roscoff’01
Power consumption • General model and motivations • P=Pstat+Vdd.Cd.Df (gate level model) • Estimate at RTL level (entropy based models) • Mainly dictated by : • On chip area cost and activity • Off-chip I/O volume • System level power model ? • Estimate from specs and target arch. Séminaire COSI-Roscoff’01
System Memory CPU FPGA Ext world Target architecture • Embedded CPU • Power PC • NIOS • Soc bus • Amba, Coreconnect • Plug ’n play IP cores • Shared Memory • Low latency • High bandwidth Séminaire COSI-Roscoff’01
Related Work • Compiler transformations to reduce mem accesses [Kandemir] • Loop fusion • Loop tiling • Loop reordering • Design space exploration for custom memory systems [Imec] • Systematic exploration • Multi-level memory hierachy • The approach is brute force Séminaire COSI-Roscoff’01
Content • Context and motivations • Target architectures • Partitioning • Clustering (LSGP) • Tiling (LPGS) • Co-partitionning • modeling Power • Experimental results • Conclusion Séminaire COSI-Roscoff’01
Tiling (LPGS) • Partition PE array into Tiles • Tiles are executed sequentially • Intermediate results stored in off-chip memory • requires unidirectionnal communications : • Tile shape is rectangular • Bound // to PE space base vectors • Perfect « Tiling » of processor space Séminaire COSI-Roscoff’01
Tiling (LPGS) w1=2 w2=3 domain height • Matrix W • diagonal • det|W|=Npe Séminaire COSI-Roscoff’01
Clustering (LSGP) • Regroups PEs into Clusters • operations executed sequentially • I/O accesses reduced • Cluster shape is rectangular • Bound // to PE space basis vectors • Perfect « Tiling » of processor space • Scheduling is axes-major • Several possible schedulings • Seq. of clustering along each axis • Simplifies control logic Séminaire COSI-Roscoff’01
Original space-time mapping PE index vector Iteration index vector Clustering (LSGP) • Matrix G • diagonal • det|G|=Npe • size syx…xsx sy=2 sy=3 Séminaire COSI-Roscoff’01
Clustering (LSGP) 3 6 1 1 1 2 1 1 1 PE original sx=2 sx=2, sy=3 Resource usage estimate : Séminaire COSI-Roscoff’01
Hybrid-partitioning • Step1 : array is Tiled • Tune the I/O volume • Step2 : Tile is clusteredArray • Tune the resource usage • Trade-Off • Off-chip I/O Volume • Local memory sizes Séminaire COSI-Roscoff’01
Content • Context and motivations • Target architectures • Partitioning • modeling Power • IO power model • Core power model • Putting it all together • Experimental results • Conclusion Séminaire COSI-Roscoff’01
Dynamic IO Energy model • IO Energy depends on • IO volume (Ram clock speed) • Operation (Rd,Wr) • Port Toggle rate Eio=Krd.Vrd+ Kwr.Vwr • Determine IO volume • For all loop variables • Given tiling parameters Technological constant Number write I/O operations Séminaire COSI-Roscoff’01
IO Volume estimate (1/2) • Tile IO volume is called « foot print » • Estimate for this foot print [Arg95] • Spread vectorof dependencies : substituting ith row with spread vector Séminaire COSI-Roscoff’01
IO Volume estimate (1/2) • Total Tile IO volume: • Example : dA=[1 0 0] aA=[1 0 0] lA=2 VA= 2.H.w1 dB=[0 1 0] aB=[1 0 0] lB=2 VB= 2.H.w2 dC=[0 0 1] aC=[1 0 0] lC=4 VC= 4w1 w2 Number of variables kth variable byte width Spread vector Tile size parameter Séminaire COSI-Roscoff’01
Core power model (1/4) • FPGA power dissipation model Pcore=Pstat+Kc.Dlc.nlc.f • Not suited to our target FPGA architecture. • Distinction between LCs (mem and logic) Pcore=Pstat+Kc.Dlc.nlc.f+ Km.Dm.nm.f Design operating freq. Technology constant Nbs of logic cells Average toggle rate Séminaire COSI-Roscoff’01
Core power model (2/4) • Control logic is not modeled • too complex to estimate • no significant contribution to power • Core power depends on • Number of PEs : depends on G andW • Area usage for each PE : depends on W • Average toggle rate for PE datapath and local memory (application constant) Séminaire COSI-Roscoff’01
Core power model (3/4) • Memory ressource usage • LCs used as distributed memory (16x1bits) • Datapath is design constant (library based) • Area cost for a PE array Datapath functional cost Number of PEs Register width along processor space k Clustering parameter along processor space j Séminaire COSI-Roscoff’01
Core power model (4/4) • Energy cost for the whole loop nest • we have Ec=Pc.ncycle.Tcycle • we will considerncycle=Vcalc/np • Total core energy cost Average toggle rate Total loop computation volume Energy is not dependant on np !! Séminaire COSI-Roscoff’01
Content • Context and motivations • Target architectures • Partitioning • Modeling Power • Experimental results • Model validation • Extrapolations • Conclusion Séminaire COSI-Roscoff’01
IO power model results Séminaire COSI-Roscoff’01
Core power model results Séminaire COSI-Roscoff’01
System power model Séminaire COSI-Roscoff’01
Content • Context and motivations • Target architectures • Partitioning • modeling Power • Experimental results • Conclusion • Solving the optimisation problem (Lagrange Multipliers) • Custom cache for embedded CPUs • Extension to SAREs (affine dependances) Séminaire COSI-Roscoff’01
Conclusion • Models matches experiments • Cheap measurement setup • Many components contribute to current dissipation (LEDs, PCI, etc…) • Observations • Trade-off evolves with technology • More sensitive for Asics ? Séminaire COSI-Roscoff’01
Future Work(1/2) • Formulation of the optimization pb • Minimize Energy/iteration • Contraints on Performance and Area • Analitycal solution ? • Lagrange multipliers • No closed form for n>3 • BUT fast numerical methods Séminaire COSI-Roscoff’01
Future Work(2/2) • Model for embedded CPUs • Trade-off cache-size and memory acceses. • Determine optimal cache size and associated tiling parameters. • Extension to SARE ? • Affine dependencies. • More general loops. Séminaire COSI-Roscoff’01