Wei Zhang † , Li Shang ‡ and Niraj K. Jha †

NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture Wei Zhang†, Li Shang‡ and Niraj K. Jha† Dept. of Electrical EngineeringPrinceton University† Dept. of Electrical and Computer Engineering Queen’s University ‡

Outline • Temporal Logic Folding • Background on NRAMs • Overview for hybrid NAnoTUbe/CMOS REconfigurable architecture (NATURE) (DAC 2006) • NanoMap: Design Optimization Flow • Experimental Results • Conclusions

Temporal Logic Folding • Basic idea: Use run-time reconfiguration to realize different functions in the same resource every few cycles LUT 1 LUT 1 LUT 2 LUT 2 LUT 3 LUT 3 LUT 1 LUT 2 LUT 3 MEM i =abc’ l =(I’+e’+f’)h’ OUT =d’g’+l

Overview of NATURE • Distributed non-volatile nanotube RAMs (NRAMs): main storage for reconfiguration bits • Fine-grain reconfiguration (even cycle-by-cycle) and logic folding • Area-delay trade-off flexibility • More than an order of magnitude increase in logic density • More than an order of magnitude reduction in area-time product • Comparisons assume NRAMs/ CMOS logic implemented in the same technology • Non-volatility: useful in low power & secure processing CMOS fabrication compatible NRAM-based Run-time reconfiguration NATURE Temporal logic folding Logic density Design flexibility

Overview of NATURE (Contd.) • Challenges in nano-circuits/architectures • Many programmable nanofabrics proposed: Nanowire PLA (Dehon, 2004), CMOL (Strukov, 2005), etc. • Lack of a mature fabrication process • Fabrication defects and run-time failures (between 1% and 10%) • Regular, reconfigurable architectures, such as an FPGA, favored • Facilitates fabrication • Fault tolerance through reconfiguration • NATURE: fabricatable using CMOS-compatible fabrication process

NRAMTM by Nantero • Non-volatile nanotube random-access memory (NRAM) • Mechanically bent or not: determines bistable on/off states • Same/opposite voltage added to change the state • CMOS-compatible fabrication process • 10 Gbit NRAMs already fabricated: ready to be commercialized in the near future Source: http://www.nantero.com/nram.html

NRAMs • Properties of NRAMs • Non-volatile • Similar speed to SRAM • Similar density to DRAM • Chemically and mechanically stable • NATURE not tied to NRAMs • Phase change RAM • Magnetoresistive RAM • Ferroelectric RAM

Architecture of NATURE • Island-style logic blocks (LBs) connected by various levels of interconnects • An LB contains a super macroblock (SMB) and a local switch matrix

Architecture of a Super Macroblock (SMB) • n1macroblocks (MBs) comprise an SMB:here n1 = 4

Architecture of a Macroblock (MB) • n2 logic elements (LEs) comprise an MB:here n2 = 4

Logic Element (Basic Configuration) • An LE implements a computation and contains: • An m-input look-up table (LUT) • l flip-flops • Input to flip-flop selected between LUT output and a primary input

Folding Levels • Logic folding at different levels of granularity, providing flexibility to perform area-delay trade-offs • Level-p folding: LE reconfiguration after the execution of p LUT computations • Reconfiguration time: 160ps • Larger folding level, typically delay decrease, area increase (a) level-1 folding (b) level-2 folding

Design Optimization Flow: NanoMap • Optimize and implement design on NATURE • Integrate temporal logic folding • Choose a proper folding level • Use force-directed scheduling (FDS) technique to balance resource usage across folding cycles • Input design specified in register-transfer level (RTL) and/or gate-level VHDL

Motivational Example • Different planes should have same number of folding stages to guarantee global synchronization • Key issue: how to achieve the optimization objective • Appropriate folding level • Assign the logic to folding stages Level 1 register Logic in Plane Folding stage Plane cycle Folding cycle Plane Level 2 register

Motivational Example (Contd.) • Example optimization objective • Minimize circuit delay under an area constraint of 32 LEs • Assume each LE contains one LUT and two flip-flops: 32 LEs provide 32 LUTs and 64 flip-flops 8 LUTs Logic depth: 4 50 LUTs 14 flip-flops Plane depth: 9 38 LUTs Logic depth: 7

Iterative Design Flow • Start with initial guess for folding level and iteratively refine it • Large folding level -> better circuit delay, but large area cost • Initial #folding stages: • Initial folding levels: • Partition RTL modules into a series of connected LUT clusters • logic depth at most equal to the folding level • Significantly speeds up the mapping procedure

Iterative Design Flow (Contd.) • Cluster size should be smaller than the area constraint 34 LUTs > 32 LUTs Level-5 folding Level-4 folding

Solution for the Example • Three folding stages using level-4 folding • 32 LEs required for mapping the RTL circuit; area constraint satisfied • Circuit delay = 3 * folding cycle delay

NanoMap: Flow Diagram Input network Output 1 reconfiguration bits Optimization Module Routing 16 objective Circuit parameter library search Final routing 2 using VPR router Folding level 15 computation User 3 constraint Final placement using modified VPR RTL module partition placer Logic Mapping 4 14 Yes No Perform logic folding ? No Satisfy delay 5 constraints ? Yes 12 Schedule each LUT / Temporal placement LUT cluster Delay estimation using FDS 6 11 Yes Map each 7 No Placement LUT / LUT cluster to routable ? SMBs Temporal clustering 10 7 Fast placement Satisfy area No Refine No using modified VPR constraints ? placement ? placer 8 13 Yes Yes 9

Force-Directed Scheduling • Perform FDS on RTL modules partitioned into LUTs/LUT clusters • Iteratively schedule LUT/(LUT cluster) to minimize overall resource usage • Model resource usage as a force: F = Kx • K: distribution graphs (DGs) that describe the probability of resource usage • Aim of FDS: minimize force, indicating minimum increase in resource usage • LE usage depends on LUT computations and register storage operations:two DGs needed

Temporal Clustering • For each folding stage, a constructive algorithm used to assign LUTs to LEs and pack LEs into MBs and SMBs • Unpacked LUT with a maximal number of inputs selected as initial seed • New LUTs with high attractions to the seed selected and assigned to the SMB • Attractions depend on timing criticality and input pin sharing • Considers attractions across all the folding cycles

Placement and Routing • VPR (U. Toronto) modified to perform placement and support temporal logic folding • Simulated annealing approach • Cost function computed across the folding stages • Routing using VPR router performed hierarchically, considering direct link, length-1, length-4 and global interconnects

Experimental Setup • Instance of architecture: • 4 MBs in an SMB • 4 LEs in an MB • LEs contain a 4-input LUT and 2 flip-flops • Impact of fixing k at 16 vs. allowing a high enough k to show design trade-offs • Results based on 100nm technology parameters to implement CMOS logicand NRAMs

#LE * Delay adv. for AT opt. No folding k enough k = 16 18 16 14 12 10 8 6 4 2 0 ex1 ex2 FIR c5315 Paulin ASPP4 Biquad (normalized to no-folding) Experimental Results (Contd.) 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 2 1 1 2 2 1 2 2 2 2 1 1

LE utilization around 100% 50% reduced need for a deep interconnect hierarchy for level-1 vs. no-folding – indicates trading interconnect area for NRAM area advantageous Experimental Results (Contd.) Improvement under AT optimization for RTL Benchmarks

Experimental Results (Contd.) • Flexibility in choosing the best folding level and performing area-delay trade-offs • Mapping results for typical optimizations using Paulin benchmark as an example Typical optimizations

Conclusions • NATURE: A new high-performance run-time reconfigurable architecture • NanoMap: an integrated optimization design flow for NATURE • Introduction of NRAMs into the architecture enables cycle-by-cycle reconfiguration and logic folding: leading to significant logic density and area-time product advantages • Can be very useful for cost-conscious embedded systems and improvement of future FPGAs • Non-volatility: helpful in secure and low power processing

Wei Zhang † , Li Shang ‡ and Niraj K. Jha †