310 likes | 333 Views
Connect On the Fly: Enhancing and Prototyping of Cycle-Reconfigurable Modules. Hao Zhou∗, Xinyu Niu†, Junqi Yuan∗, Lingli Wang∗, Wayne Luk † †Dept. of Computing, Imperial College London, UK
E N D
Connect On the Fly: Enhancing and Prototyping of Cycle-Reconfigurable Modules Hao Zhou∗, Xinyu Niu†, Junqi Yuan∗, Lingli Wang∗, Wayne Luk† †Dept. of Computing, Imperial College London, UK ∗School of Microelectronics, Fudan University, China
Outline • motivation • EURECAarchitecture • Architectureoptimisation • results • summary
Summary of Contributions 1. Reconfiguration strategies for data access patterns • Three categories for dynamic data accesses • Reconfiguration modes: maximise configuration sharing 2. Architecture design space exploration • Multiplexers or permutation network • Optimal EURECA module size 3. Prototype chip layout • Manually developed with SMIC 130-nm technology • Results measured using Cadence tools
Static vs dynamic data access • conditional arithmetic operators • dynamic data access patterns for (i=0; i<n; i+=N) #parallel unroll N for(j=0; j<N; j++){ k=i*N+j; d[k]=a[k+1] * c[k]; } for (i=0; i<n; i+=N) #parallel unrollN for(j=0; j<N; j++){ k=i*N+j; d[k]=a[b[k+1]] * c[k]; } static:easy dynamic:hard
Example: dynamic data access d for (i=0; i<n; i++){ d[k]=a[b[k+1]] * c[k]; } C connectionsfor onedata-path:easy
Example: dynamic data access for (i=0; i<n; i+=N) #parallel unroll N=32 for(j=0; j<32; j++){ k=i*32+j; d[k]=a[b[k+1]] * c[k]; } connectionsfor 32data-paths:hard
Implementation 1: multiplexors • congested routing • 1024-to-1024 bitconnections • unroutable in XC6V-SX475T • expensiveusermultiplexers d 32outputports,each32-bitwide
EURECA module: execution flow Initialconfiguration CDN: Configuration Distribution Network Initialconfiguration
EURECA module: execution flow CG: Configuration Generator Data-paths CG InitialConnections
Run-time reconfiguration flow CG Runtimeconfiguration InitialConnections
Run-time reconfiguration flow CG Initialconfiguration
Run-time reconfiguration flow data-pathdata CG Initialconfiguration memorydata
1. Optimising Data Access Patterns • Static(a): accesses with fixed strides • Dynamic size(b): linear accesses with variable vector size • Dynamic offset(c): vector access with dynamic offsets • Random(d): each access with a dynamic offset
Dynamic FIFOs • Dynamic sizeDynamicFIFOs:BRAMsorganisedasFIFOs • AsingleconfigurationissharedbyallruntimereconfigurableconnectionsinaEURECAmodule • EnablesignalstoFIFOsaredynamicallyconnectedtocorrespondingFIFOports
Dynamic cache • Dynamic offsetDynamiccache:BRAMsassharedcache • Singleconfigurationsharedbyallruntimereconfigurableconnections • InputaddresstoBRAMsaredynamicallyconnected
Dynamic shared cache • RandomDynamicsharedcache:accessconflicts • Accessconflictshappenwhen2ormoredata-pathstrytoaccessthesameportatthesametime:thelowertwodata-pathsinthefigure • Eachreconfigurableconnectionhasaseparateconfiguration • Conflictedports(thelowestdata-path)aredisabledbyscheduler
2. Optimising Connection Network • Multiplexersorpermutationnetwork? • Threeparameterstoevaluate: • Numberofpinstosupportruntimereconfiguration • Siliconareatoimplementthenetwork • Logicrequiredtogenerateruntimeconfigurationsforthenetwork
Connection Network • Multiplexersorpermutationnetwork? • Multiplexersareselectedasconfigurationscanbeshared • Toreconfigureapermutationnetworkwithinasinglecycletakesanunacceptableamountofpinsandconfigurationgenerationlogic
Architecture Efficiency • Architectureefficiencyisevaluatedwiththeareasavedbyruntimereconfiguration,multipliedbyoverallareaoverhead Original chip area Original application resource usage Runtime reconfigurable application resource usage EURECA chip area
EURECAOptimisation • Connectionnetwork:multiplexersorpermutationnetwork • Memorygroupsize:thenumberofBRAMblocksconnectedtoasingleEURECAmodule • Circuit models for application area reduction and module area • Small memory group size: small reductions in design area • Large memory group size: large module area overhead • Efficiencyreachesmaximumat32BRAMs, supported by circuit models
3. EURECAPrototype • PrototypelayoutdevelopedwithSMIC130-nmtechnology • Classicisland-styleFPGAarchitectureadapted • SmallFPGAsizeduetotape-outbudget • AsingleEURECAmoduleaddedtoonecolumnofBRAMs(8) • PrototypelayoutdevelopedwithSMIC130-nmtechnology • AEURECAmoduleisthesameareaas2.72CLBcolumns • AEURECAmodulebrings1.15nsdelay(typicalcircuitcriticalpathdelayonthisarchitecture20-40ns)
Application 1: Large-scale Sorting • Large-scalesorting:dynamicFIFOs • Ateachcycle,thereareunknownamountofvaluescommittedfromeachFIFO • Thestartingaddresschangesfromcycletocycle
Application 2: Memcached • Memcached:dynamiccache • Avectordatapointedbyadynamicpointerisreadeachcycle • Theaddresschangesfromcycletocycle
Application 3: SpMV • SpMV:dynamicsharedcache • Accessedlocationsinthevectordependsonthepostionofnon-zerosinthesparsematrix • Eachvectoraccessisrandom • Aschedulermoduletoonlyenableoneruntimereconfigurationforamemoryport
Experimental Setup • Targetchip • PrototypeEURECAchip • Area,delaymeasuredfromCadencesimulationresults • synthesisenvironment • Design Compiler (DC) for circuit synthesis • ABC for mapping • A graph matching algorithm for packing • Simulated annealing algorithm for placement • Path-finder for routing
ApplicationPerformance • Threeimplementationsdevelopedforeachapplication • StaticdesignuseLUTstosupportalldynamicdataaccesses • EURECAsupportsdynamicaccesseswithoriginalarchitecture • Dynamicusestheoptimisedarchitecture,supportmaximumconfigurationsharing
ApplicationPerformance • Threeimplementationsdevelopedforeachapplication • Upto11.2timesarea-delayreductionscomparedwithstaticdesign • Upto1.39timesarea-delayreductioncomparedwithoriginalEURECAarchitecture
ApplicationPerformance • Area-delayreductionsgrowslinearlyasthenumberofBRAMsconnectedtoaEURECAmodulegrows • Areareductionincreasesto16for32BRAMs
Current and future work • EURECAfull-stackcompiler • EURECAprogrammingmodels • operationmapping • dataaccessmapping • many-regioncommunication • EURECAsimulatorforarchitectureoptimisation • AutomaticallyexploredesignspacewithgeneratedarchitecturefilesandVerilogmodules
Summary: EURECA • Routing challenge: dynamic data access applications • Threedataaccesspatterns • Optimisedreconfigurationstrategies • Designspaceexplorationandprototypelayout • Experimentalresults: prototypelayout+synthesistool • smallareaanddelayoverhead • Applicationperformance • upto1/11.2xarea-delayproductcomparedwithstaticarchitecture • upto1/1.39xarea-delayproductcomparedwithinitialarchitecture