1 / 58

Simulation of Fracturable LUTs

Simulation of Fracturable LUTs. Tim Pifer. Presentation Overview. Altera ALM from Stratix II Stratix V architecture Current VPR method for Fracturable LUTS Wiremap for technology mapping AApack for packing. Altera Adaptive Logic Module.

Download Presentation

Simulation of Fracturable LUTs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Simulation of Fracturable LUTs Tim Pifer

  2. Presentation Overview • Altera ALM from Stratix II • Stratix V architecture • Current VPR method for Fracturable LUTS • Wiremap for technology mapping • AApack for packing

  3. Altera Adaptive Logic Module • Traditional 4LUTs provide the best area-delay product • Larger LUTs • Shorter critical path • Absorb more logic • Larger LUT mask • More input Muxing • reduce critical path depth by 20% , improving area Improving FPGA Performance and Area Using an Adaptive Logic Module Mike Hutton, Jay Schleicher, David Lewis, Bruce Pedersen, Richard Yuan, SinanKaptanoglu, Gregg Baeckler, Boris Ratchev, KetanPadalia, Mark Bourgeault, Andy Lee, Henry Kim and RahulSaini

  4. Motivation from other architectures • BLE5 - 15% fewer LUTs , 25% shorter unit delay • BLE6 - 22% fewer LUTs , 36% shorter unit delay • BLE7 - 28% fewer LUTs , 46% shorter unit delay • K=6 25% 6LUT • Example design • K=4 100 LUTs • K=6 78 LUTs : 23 6LUT,32 5LUT,17 4LUT,9 3LUT,13 2LUT • Stratix, VirtexII : 30:1 input mux • relative contribution of routing area and interconnect delay increase with each generation of fabrication

  5. Simple Example: 6LUT from 4BLEs • Larger Area • 19 input, 4 registers • For 6-LUT, 4LUTs have identical inputs, separate input muxes • 3 /4 registers, outputs wasted

  6. Improved Example: 6,2 Fracturable LE • 8 Inputs, 2 outputs, 2 registers • 1 6LUT • 2 5LUTs with input Sharing • 2 independent 4 LUTS • comparable in area with two BLE4 • Functionally closer to two BLE5 logic elements.

  7. Final Version • Composed of 3LUTs • Added d2 muxed output • c1 or GND ,c2 or VCC muxed • remove mux from d1 • swap muxes controlled by R and T • two 6LUTs share 4 inputs, identical LUT-mask • 4:1 muxes, common data, different select lines • Up to 12% pairs of 6-LUTs • R=0 T=1 S=1 implements 2 muxed 5Luts with 7 inputs • F1 = fn(a1,a2,b1,b2,d1) • F2 = fn(a1,a2,b2,c2,d1) • Out = mux(F1,F2,c1) • roughly area-neutral with BLE4 and 36% decrease in logic depth

  8. How do we set RSTU for a 6LUT?

  9. 8:1 mux implementation • 8:1 mux in 2 ALMs (4 ALUTs) using 7 input functions • second ALM computes output • F1=fn(s0,s1,d3,y0,y1) • F2=fn(s0,s1,d7,y0,y1) • muxcontrolled by s2 • 5 BLE4 vs. 2 ALMs, saves one BLE4

  10. Stratix V • ALM can become 2 4LUTS • eight inputs for both ALUTs • backward-compatible with 4LUT architectures Logic Array Blocks and Adaptive Logic Modules in Stratix V Devices

  11. Stratix V

  12. Normal Modes

  13. Normal LUT mode • single 6LUT mode, other inputs used for registers

  14. Extended LUT mode • 7 input function • 2-to-1 multiplexer with two 5LUTS sharing 4 inputs. • If Else statements

  15. Why 6LUTS: DES Example • DES : 8 sboxes or substitution tables • sboxhas 6 inputs, 4 outputs • Each output: • 16LUT • 6 4LUTs. • 35-45% less area

  16. How would we alter technology mapping to best support FLUTs?

  17. Technology Mapping • 1 4LUT and 2 6LUTs requiring 3 ALMs • Could use 4 5LUTs requiring 2 ALMs and the same logic depth

  18. Balancing Technology Mapping • Must maintain optimal critical path depth, more packable LUT distribution • avoid 6-LUTs when not helping delay • 8:1 muxes identified separately and mapped to 7 input functions • 7% of ALMs are 7-input functions

  19. Results: Performance • 80 designs tested • 130nm process • Minimum chip size used • Spice models for delay

  20. Results: Area

  21. Stratix vs. Stratix II

  22. Conclusions • Benefits of 6LUTS without underutilization • Larger LUT Costs: • LUT-mask size • input and output muxing • FFs • 6-LUT is fracturable into 5 LUTs, area comparable to 2 BLE4s • 7-input functions and 6 input pairs • Technology mapping support is needed for best results • 6,2 Almvs 4BLE: • 15% better performance • 12% smaller area average

  23. How do we need to alter VPR to support FLUTs?

  24. AAPack and wiremap • ABC with Wiremap technology mapping to primitives • AApack- capable of packing complex logic blocks based on logic primitives

  25. Wiremap • reduces 6LUTs percentage • Does not increase: • logic depth • total LUT count WireMap: FPGA Technology Mapping for Improved Routability Stephen Jang, Billy Chan, Kevin Chung, Alan Mishchenko

  26. AAPack Overview • Current tools can’t support the complexity of logic blocks • New logic block description language: • Depict complex interconnects • Hierarchy • Modes of operation • Can pack complex blocks • Area driven • Area is compared to the theoretical minimum • Verilog input for large benchmarks Architecture Description and Packing for Logic Blocks with Hierarchy, Modes and Complex Interconnect Jason Luu, Jason Anderson, and Jonathan Rose

  27. Example: Virtex-6 Logic Block • Tools don’t support Stratix IV or Virtex 6 • Virtex 6: • complex soft logic blocks • hard memories • multipliers

  28. What AAPack does • Can describe: • complex logic blocks with arbitrary internal routing structures • Variable memory configurations: 4Kx8, or 8Kx4, or 16Kx2 • area-driven packing • inputs: • user design • architectural description

  29. Complex Block Description Language • Expressive: The language should be capable of describing a wide range of complex blocks. • Simple: The language constructs should match closely with an FPGA architect’s existing knowledge and intuition. • Concise: The language should permit complex blocks to be described as concisely as possible.

  30. Physical blocks • Specified in XML • Hierarchy • Other blocks and • existing primitives • Inputs and outputs and clocks with pin numbers

  31. Primitives • Common primitives are handled in the language • LUTs inputs can be reordered, a memory address cannot

  32. Intra-Block Interconnect • Complete: crossbar switch –internal programmable signal • direct: direct connection- wire connection, no programmability • mux: multiplexed connection single-bit/bus - programmable signal

  33. Modes of Operation • Mutually exclusive functionality • Represent FPGA structures being used in different ways

  34. Packing Algorithm • Input: technology mapped Netlist, XML architecture • Output: Packed complex blocks • Greedy algorithm similar to other packing methods • while until all blocks are packed • Seed block s selected and packed • New complex block B for s • Pack additional blocks into B • Choose a compatible block c • Pack c into B if valid • Add B to Packed list

  35. Selecting Netlist and Complex Blocks • Choose the block with the most nets attached • Candidates are selected based on affinity in equation 1 • Affinity = shared nets and connections divided by the number of pins the new block would add. • Connections is a measure of how likely the new block will need external connections • Alpha is set to .9

  36. Legality: Location • attempting to pack: • chooses a location • verifies routing • traversing the complex block as a tree • ordered smallest to largest right to left • traversed right to left to ensure smallest resource consumption • attempts to pack the other nodes in the sub-tree: find a flip flop for a LUT • 30 packs on the sub-tree

  37. Legality: Routablility • Initially, check if packing would exceeded external pin count • Then, generate routing graph for complex block • Assume any output can connect to any input of a complex block (switchbox architecture) • Apply pathfinder

  38. Memory • Primitives are technology mapped with a single bit width • 256 X 8 memory mapped as 8 256X 1 bit memories • primitives mapped to same component if bus signals identical

  39. Limitations • No support for timing in this implementation • primitive can map to only one complex logic block • flip flop can only be used in a LUT complex block, they cannot also be present in Multiplier complex blocks

  40. What are some faults of this packing method?

  41. Experiments • Verilog benchmarks • soft processors • image processors

  42. Fracturable LUTS • CLBs: • fully connected BLEs • FI X N – no pin sharing • 8 BLEs • BLEs : • 1FLUT • 2 flip flops • 2 outputs • FlUTs: • 2 Modes • 6Lut • Dual 5LUTs • Variable number of inputs • Dual mode input sharing depends on number of inputs

  43. FLUT Evaluation • Compare achieved area with the lower bound • Lower bound: number of complex blocks needed to contain the primitives without routing considerations • Efficiency : ratio of the achieved number of logic blocks and this value

  44. Efficiency Results • 5 indicates all inputs are shared, 10 indicates no inputs are shared • 6 or 7 achieves tolerable efficiency • Number of inputs FI varied 5 – 10 • Geometric average across 5 benchmarks

  45. Logic blocks and Channel width with number of inputs • # blocks decreases to 7 • Channel width from # inputs • first increases from more routing to each block • then decreases after 7: full efficiency so easier routing

  46. Memory • Varied # bits and max width • best utilization: smallest size,maximum width

  47. CLB consumption with memory size • Smaller memories: more logic due to muxes • Best results: multiple memory sizes

  48. Conclusions • New language can describe complex architectures using: • Hierarchy • Modes • Arbitrary interconnects • Packing algorithm for this architecture • Verified on large benchmarks • Needs timing driven packing

  49. How can we get additional improvement from technology mapping?

  50. Academic FLUTs soft logic • 4 architectures : • K = 6 • M = 5,6,7,8 • M5: dual-output 6-LUT of a Xilinx Virtex5 • M8: StratixII ALM Exploring FPGA Technology Mapping for Fracturable LUT Minimization David Dickin, Lesley Shannon

More Related