310 likes | 335 Views
The paper discusses techniques to optimize data access patterns in cycle-reconfigurable modules, examining reconfiguration strategies, architectural design, and prototype chip layout. It explores static versus dynamic data access, conditional arithmetic operators, and dynamic data access patterns. The study evaluates the implementation of multiplexers, dynamic FIFOs, and shared caches for improved efficiency. The EURECA architecture and its optimization methods are detailed, enabling better resource utilization in applications like large-scale sorting, Memcached, and SpMV computations.
E N D
Connect On the Fly: Enhancing and Prototyping of Cycle-Reconfigurable Modules Hao Zhou∗, Xinyu Niu†, Junqi Yuan∗, Lingli Wang∗, Wayne Luk† †Dept. of Computing, Imperial College London, UK ∗School of Microelectronics, Fudan University, China
Outline • motivation • EURECAarchitecture • Architectureoptimisation • results • summary
Summary of Contributions 1. Reconfiguration strategies for data access patterns • Three categories for dynamic data accesses • Reconfiguration modes: maximise configuration sharing 2. Architecture design space exploration • Multiplexers or permutation network • Optimal EURECA module size 3. Prototype chip layout • Manually developed with SMIC 130-nm technology • Results measured using Cadence tools
Static vs dynamic data access • conditional arithmetic operators • dynamic data access patterns for (i=0; i<n; i+=N) #parallel unroll N for(j=0; j<N; j++){ k=i*N+j; d[k]=a[k+1] * c[k]; } for (i=0; i<n; i+=N) #parallel unrollN for(j=0; j<N; j++){ k=i*N+j; d[k]=a[b[k+1]] * c[k]; } static:easy dynamic:hard
Example: dynamic data access d for (i=0; i<n; i++){ d[k]=a[b[k+1]] * c[k]; } C connectionsfor onedata-path:easy
Example: dynamic data access for (i=0; i<n; i+=N) #parallel unroll N=32 for(j=0; j<32; j++){ k=i*32+j; d[k]=a[b[k+1]] * c[k]; } connectionsfor 32data-paths:hard
Implementation 1: multiplexors • congested routing • 1024-to-1024 bitconnections • unroutable in XC6V-SX475T • expensiveusermultiplexers d 32outputports,each32-bitwide
EURECA module: execution flow Initialconfiguration CDN: Configuration Distribution Network Initialconfiguration
EURECA module: execution flow CG: Configuration Generator Data-paths CG InitialConnections
Run-time reconfiguration flow CG Runtimeconfiguration InitialConnections
Run-time reconfiguration flow CG Initialconfiguration
Run-time reconfiguration flow data-pathdata CG Initialconfiguration memorydata
1. Optimising Data Access Patterns • Static(a): accesses with fixed strides • Dynamic size(b): linear accesses with variable vector size • Dynamic offset(c): vector access with dynamic offsets • Random(d): each access with a dynamic offset
Dynamic FIFOs • Dynamic sizeDynamicFIFOs:BRAMsorganisedasFIFOs • AsingleconfigurationissharedbyallruntimereconfigurableconnectionsinaEURECAmodule • EnablesignalstoFIFOsaredynamicallyconnectedtocorrespondingFIFOports
Dynamic cache • Dynamic offsetDynamiccache:BRAMsassharedcache • Singleconfigurationsharedbyallruntimereconfigurableconnections • InputaddresstoBRAMsaredynamicallyconnected
Dynamic shared cache • RandomDynamicsharedcache:accessconflicts • Accessconflictshappenwhen2ormoredata-pathstrytoaccessthesameportatthesametime:thelowertwodata-pathsinthefigure • Eachreconfigurableconnectionhasaseparateconfiguration • Conflictedports(thelowestdata-path)aredisabledbyscheduler
2. Optimising Connection Network • Multiplexersorpermutationnetwork? • Threeparameterstoevaluate: • Numberofpinstosupportruntimereconfiguration • Siliconareatoimplementthenetwork • Logicrequiredtogenerateruntimeconfigurationsforthenetwork
Connection Network • Multiplexersorpermutationnetwork? • Multiplexersareselectedasconfigurationscanbeshared • Toreconfigureapermutationnetworkwithinasinglecycletakesanunacceptableamountofpinsandconfigurationgenerationlogic
Architecture Efficiency • Architectureefficiencyisevaluatedwiththeareasavedbyruntimereconfiguration,multipliedbyoverallareaoverhead Original chip area Original application resource usage Runtime reconfigurable application resource usage EURECA chip area
EURECAOptimisation • Connectionnetwork:multiplexersorpermutationnetwork • Memorygroupsize:thenumberofBRAMblocksconnectedtoasingleEURECAmodule • Circuit models for application area reduction and module area • Small memory group size: small reductions in design area • Large memory group size: large module area overhead • Efficiencyreachesmaximumat32BRAMs, supported by circuit models
3. EURECAPrototype • PrototypelayoutdevelopedwithSMIC130-nmtechnology • Classicisland-styleFPGAarchitectureadapted • SmallFPGAsizeduetotape-outbudget • AsingleEURECAmoduleaddedtoonecolumnofBRAMs(8) • PrototypelayoutdevelopedwithSMIC130-nmtechnology • AEURECAmoduleisthesameareaas2.72CLBcolumns • AEURECAmodulebrings1.15nsdelay(typicalcircuitcriticalpathdelayonthisarchitecture20-40ns)
Application 1: Large-scale Sorting • Large-scalesorting:dynamicFIFOs • Ateachcycle,thereareunknownamountofvaluescommittedfromeachFIFO • Thestartingaddresschangesfromcycletocycle
Application 2: Memcached • Memcached:dynamiccache • Avectordatapointedbyadynamicpointerisreadeachcycle • Theaddresschangesfromcycletocycle
Application 3: SpMV • SpMV:dynamicsharedcache • Accessedlocationsinthevectordependsonthepostionofnon-zerosinthesparsematrix • Eachvectoraccessisrandom • Aschedulermoduletoonlyenableoneruntimereconfigurationforamemoryport
Experimental Setup • Targetchip • PrototypeEURECAchip • Area,delaymeasuredfromCadencesimulationresults • synthesisenvironment • Design Compiler (DC) for circuit synthesis • ABC for mapping • A graph matching algorithm for packing • Simulated annealing algorithm for placement • Path-finder for routing
ApplicationPerformance • Threeimplementationsdevelopedforeachapplication • StaticdesignuseLUTstosupportalldynamicdataaccesses • EURECAsupportsdynamicaccesseswithoriginalarchitecture • Dynamicusestheoptimisedarchitecture,supportmaximumconfigurationsharing
ApplicationPerformance • Threeimplementationsdevelopedforeachapplication • Upto11.2timesarea-delayreductionscomparedwithstaticdesign • Upto1.39timesarea-delayreductioncomparedwithoriginalEURECAarchitecture
ApplicationPerformance • Area-delayreductionsgrowslinearlyasthenumberofBRAMsconnectedtoaEURECAmodulegrows • Areareductionincreasesto16for32BRAMs
Current and future work • EURECAfull-stackcompiler • EURECAprogrammingmodels • operationmapping • dataaccessmapping • many-regioncommunication • EURECAsimulatorforarchitectureoptimisation • AutomaticallyexploredesignspacewithgeneratedarchitecturefilesandVerilogmodules
Summary: EURECA • Routing challenge: dynamic data access applications • Threedataaccesspatterns • Optimisedreconfigurationstrategies • Designspaceexplorationandprototypelayout • Experimentalresults: prototypelayout+synthesistool • smallareaanddelayoverhead • Applicationperformance • upto1/11.2xarea-delayproductcomparedwithstaticarchitecture • upto1/1.39xarea-delayproductcomparedwithinitialarchitecture