320 likes | 519 Views
Paper Review I Coarse Grained Reconfigurable Arrays. Presented By: Matthew Mayhew I.D.# 0234815 ENG*6530 Tues, June, 10, 2008. References. Link 2: Chapter 2: Coarse-Grained Reconfigurable Architectures
E N D
Paper Review ICoarse Grained Reconfigurable Arrays Presented By: Matthew Mayhew I.D.# 0234815 ENG*6530 Tues, June, 10, 2008
References • Link 2: Chapter 2: Coarse-Grained Reconfigurable Architectures • Parizi, H.; Niktash, A.; Bagherzadeh, N,; Kurdahi, F.; MorphoSys: A Coarse Grain Reconfigurable Architecture for Multimedia Applications, Euro-Par 2002 Parallel Processing. 8th International Euro-Par Conference. Proceedings (Lecture Notes in Computer Science Vol.2400), 2002, p 844-8
References Cont. • Sadasivam, M.; Hong, S.; Application Specific Coarse-Grained FPGA for Processing Element in Real-Time Parallel Particle Filters, Proceedings 3rd IEEE International Workshop on System-on-Chip for Real-Time Applications, 2003, p 116-19 • Veredes, F,; Scheppler, M.; Moffat, W.; Mei, B.; Custom Implementation of the Coarse-Grained Reconfigurable ADRES Architecture for Multimedia Purposes, Proceedings. 2005 International Conference on Field Programmable Logic and Applications (IEEE Cat. No.05EX1155), 2005, p 106-11
Overview • Introduction • Basic Concepts • Classifications • General Architectures • Research Architectures • MorphoSys • Architecture for Dynamically Reconfigurable Embedded System (ADRES) • Coarse Grained FPGA for parallel partical processing • Project Summary
Problems with Fine Grained FPGAs • Wide datapaths constructed of bit level elements to allow for processing on individual bits. • Requires a high volume of reconfiguration data for the processing elements and routing switches. • Difficulty in mapping from high level languages due to the difference in granularity.
Coarse Grained Architectures • Constructed from multi-bit wide datapaths and complex operators. • Wide datapath allows for the implementation of complex operators, reducing routing overhead • Connections in CGRA processing elements have widths of multiple bits. As such, each connection takes more area, but fewer connections are needed.
Classification of Architectures • Coarse Grained Architectures are classified based on three criteria: • Interconnect Structure • Mesh-based • Linear Array • Crossbar • Datapath Width • Tradeoff between flexibility and area consumption • Reconfiguration Method • Static • Dynamic
Basic Architectures: Mesh-Based • Processing Elements arranged in a rectangular array with horizontal and vertical connections.
Mesh-Based Continued • Structure allows for good parallelism and use of communication resources. • Requires good tools for Place and Route. • Arrangement encourages Nearest Neighbour (NN) links, but generally has lines for longer connections.
Basic Architectures: Linear Array • Processing elements arranged in a linear fashion with neighbours generally connected. • Generally designed for the implementation of pipelined processes.
Basic Architectures: Crossbar • All Processing Elements connected by a matrix of switches, allowing for arbitrary connections. • Simple routing task. • Due to implementation restrictions, reduced crossbar more common with clusters connected.
MorphoSys • Designed to handle multimedia applications. • Due to varied tasks and a large amount of input/output data, ASIC solutions are generally expensive to develop and GPPs ineffecient. • Currently in version M2, with research ongoing.
System Architecture • The system level architecture of the MorphoSys system is shown below: Parizi, H.; Niktash, A.; Bagherzadeh, N.; Kurdahi, F.
RC Cell Architecture • The layout of an individual reconfigurable cell is shown below: Parizi, H.; Niktash, A.; Bagherzadeh, N.; Kurdahi, F.
Benefits of MorphoSys • Combination of both fine and coarse grained reconfigurable elements allow for customization and optimization depending on the application. • Memory structure designed to accommodate the high demand for data movement in multimedia applications.
Evaluation • Tested with several operations common in multimedia and DSP applications. • Tested against dedicated DSP boards. Parizi, H.; Niktash, A.; Bagherzadeh, N.; Kurdahi, F.
ADRES • Designed to achieve specified performance and power consumption targets for portable wireless media applications. • Test application for the architecture was an H.264/AVC decoder. • The ADRES architecture consists of a VLIW processor coupled with an array of coarse grained processing cells for acceleration.
ADRES Architecture • VLIW processor optimized for load/store and control operations. • The accelerator component optimized for data-flow with branching supported. • Each reconfigurable cell contains a local register file, allowing for iterative data processing and data delay. • Each reconfigurable cell can communicate with all cells in its row and column, as well as neighbouring cells within its quadrant.
System Level View • When running in acceleration mode, an 8x8 array can be formed by configuring the VLIW elements. Veredes, F.; Scheppler, M.; Moffat, W.; Mei, B.
ADRES Reconfigurable Cell • While the configuration memory is assumed to be static during execution, dynamic reconfiguration is possible using a pointer. Veredes, F.; Scheppler, M.; Moffat, W.; Mei, B.
Performance and Implementation • ADRES found to be 88% faster overall in a full decoding cycle than a standard VLIW processor. • Layout study performed using 0.13 μm technology standard cells. • Each reconfigurable cell consumes approximately 0.196 mm2. • Configuration memory accounts for around 50% of a cell, with 83% of the area in the full implementation used for various storage elements.
Parallel Particle Filter Processor • Particle filters are used in non-linear problems where the goal is to track or detect dynamic signals. • Target application of designed system is the real-time tracking of a ball-bearing, where the goal is to determine the coordinates and velocity of the target using a given input angle. • Need to generate new particles, determine appropriate weights, and resample.
Operations • Both the generation of new particles and determining the weights are performed using processing elements. • This involves the calculation of w(m), which is the weight of a particle, and f(m), which is determined by the application.
System Level Architecture • Consists of both parallel and sequential data flow, with a buffer to synchronize their behaviour. Sadasivam, M.; Hong, S.
Sequential Flow Reconfigurable Slice (SFRS) • Responsible for the calculation of f(m), with direct access to the buffer unit. Sadasivam, M.; Hong, S.
Parallel Flow ReconfigurableSlice (PFRS) • Handles updating, creating, and outputting the particles. Sadasivam, M.; Hong, S.
Reconfiguration • The architecture can be altered by changing: • The way in which particles are generated • The way in which particles update • The output method • The update of particles can be altered by reconfiguring the CORDIC unit used in the calculation of f(m), which also stores needed constants and MUX controls. • The control unit is used to control the interconnects in the SFRS to implement the desired function.
Performance • Tested against both a DSP processor and a general purpose FPGA. • It should be noted that the authors reported problems in terms of having enough logic elements to map all the required PEs on the general purpose FPGA. • The results are shown in the table below for the calculation times of both f(m) and w(m).
Conclusions • Coarse Grained reconfigurable architectures generally used in either calculation or I/O heavy applications. • Not single best design, with the architecture layout highly dependent on design goals. • Performance generally favourable when compared to dedicated processors and general purpose FPGAs.
Project • Goal: Implementation of the Advanced Encryption Standard (AES) algorithm using VHDL. • Secondary Goal: Implement the algorithm in such a way as to reduce the area consumption and computation time.
Progress • Algorithm examined in terms of where parallelism and alternative implementations can be considered. • While individual rounds must be performed sequentially, “blocks” of data within a given operation can be acted upon in parallel. • Implementation of the S-box and MixColumns operations crucial to a good application.
Thank you for your time. Questions?