F4-09: Virtual Architecture and Design Automation for Partial Reconfiguration

F4-09: Virtual Architecture and Design Automation for Partial Reconfiguration Dr. Ann Gordon-Ross Assistant Professor of ECE University of Florida Dr. Alan D. George Professor of ECE University of Florida Abelardo Jara Terence Frederick Rohit Kumar Shaon Yousuf Research Students University of Florida All Hands Meeting

Outline • Goals, Motivation and Challenges • Virtual Architecture for Partially Reconfigurable Embedded System (VAPRES) • Design methodology • Multiple clock domains support • Bitstream relocation • MACS Inter-module Communication Architecture • Case Study Application: Embedded Target Tracking System on Virtex-4 FPGA board • Preliminary non-PR version using Kalman filters • Design Automation for Partial Reconfiguration (DAPR) • DAPR design flow • VHDL annotations • Connectivity file and graph • Device library file • Overlay generation

GOAL – Leverage partial reconfiguration (PR) for application designers Architect and implement a Virtual Architecture (VA) for Partially Reconfigurable Embedded Systems Ease PR design via design automation MOTIVATIONS – Increase productivity and reduce design complexity for PR designs VA reduces development time Dynamically load and unload hardware processing modules Processing hardware adapts to external environmental conditions Automated design flow makes PR more amenable system designers Current PR design flow requires very high level of specialization Simplifies design of systems that time-multiplex FPGA resources → smaller devices CHALLENGES Provide sufficient VA flexibility with architectural parameterization Balancing enough application specialization with exploration complexity Creating new exploration algorithms/heuristics to automate PR design flow steps with respect to available PR tools Goals, Motivations, and Challenges Processed output ICAP Filter repository Filter A Central Controlling Agent Filter A Filter B PRR Sensor Interface External Trigger Sensor Coverage Area 3 3

F4-09 Approach PR System Design • Expand and prototype an FPGA-based architecture for rapid development of PR embedded systems • VAPRES: Virtual Architecture for Partially Reconfigurable Embedded Systems • MACS: Minimal Adaptive Circuit Switching mesh inter-module communication architecture for VAPRES • Improvement over F4-08 SCORES communication architecture • Architectural support for hardware module context save and restore • Formulate and implement an automated PR design flow • DAPR: Design Automation for Partial Reconfiguration Tool • Study Virtex-4 and Virtex-5 bitstreams to leverage additional functionalities • Extend bitstream relocation and context save and restore for Virtex-5 DAPR Special purpose Multi purpose VHDL Language Extensions Floorplan Generator • Flexible and reusable base architecture • Not optimized for a specific application • Tools to develop both reconfigurable modules and application software • Highly specialized PR system design • Reconfiguration behavior known at design time • Highly optimized system floorplan based on known application VAPRES Design Methodology + VAPRES Builder Tool Base Architecture 4

VAPRES: Architecture Design Flexible scalable architecture Multiple architectural parameters enable base system specialization N =number of PRRs kr =number of streaming channels going right kl =number of streaming channels going left Some additional parameters presented next Control Region PLB Bus MicroBlaze SDRAM DCR Bridge FSL Interface Slice macros UART I/O Module I/O Module Flash controller PRR1 PRR2 PRR3 To external I/O pins To external I/O pins ICAP clk0 clk2 clk1 Module Interfaces Module Interfaces Module Interfaces Module Interfaces Module Interfaces Network kr=1 3 2 PR Socket 1 PR Socket 2 PR Socket 3 kl=2 MACS switch Streaming channels Data Processing Region • Base PR embedded system • Multiple clock domains • PRMs can operate at independent clock frequencies • PRMs use FIFO-based I/O ports • High speed inter-modulecommunication architecture (MACS) N= 1 3 2 5

VAPRES: Design Methodology Application decomposition Application Flow (application designers) Base System Flow (base system designer) Application designers work separate from system designer Application software System designer chooses VAPRES parameters PRMs Base system design Base system specifications Parametric models for VAPRES and MACS enable customization Parametric VHDL models VAPRES API (vapres.h) C/C++ libraries for application software development Software implementation Software design VAPRES VHDL, MHS, MSS, and UCF System definition files PRM design Floorplan Software implementation PRM implementation is separate from base system implementation System floorplan defines PRR sizes and shapes Synthesis Synthesis Implementation Implementation Executable file Static bitstream Partial bitstreams FPGA board

VAPRES: Builder Tool Overview Automates process of buildingVAPRES base system and applications Increases designers productivity Builder Tool Features Some additional parameters used PRR height and width Automatic creation of VAPRES base system from parameters Base system floorplanning Slice macro instantiation and placement Automatic implementation of static and partial bitstreams Assisted framework for application designers Generates VAPRES SW libraries Templates for PRMs and software Architectural parameters System floorplan (.ucf) Top VHDL entity (.vhd) Hardware specifications (.mhs) Software specifications (.mss) Static base system PR modules (PRMs) Application software

VAPRES Builder – Results 1 1 2 3 PRR boundary ≈ 280 slices more when when adding an extra PRR Set of slice macros (1 set for each PRR) 100 MHz constraint met for all place-and-routed designs +0 slices +284 slices +263 slices N = number of PRRs = number of MACS switches, kr = number of channels between switches going in the right direction, kl = number of channels between switches going in the left direction

VAPRES – Bitstream Relocation PRR1 PRR2 Data Processing Region (includes one or more RSBs – Reconfigurable Streaming Blocks) System Control Region PLB Bus • Only one partial bitstream necessary for each PRM • Partial bitstreams stored in compact flash • When PRM is needed, partial bitstream is loaded into Microblaze and relocator is called • New partial bitstream is loaded into correct PRR • Program runs in external memory: • Bitstream relocator is stored in non-volatile compact flash • System ACE controller loads relocator from flash and stores it in SDRAM Microblaze SDRAM FSL Interf, UART I/O Module I/O Module SystemACE Flash ICAP clk1 clk0 Interface Interface Interface Interface To external I/O pins To external I/O pins Network SCORES Switch • In-situ Bitstream Relocation – Alters partial bitstream (with no external inputs) to run in any PRR • Advantages: • Reduces bitstream storage requirements (only one partial bitstream per module) • Saves step of reading a partial bitstream from external Flash memory, if similar partial bitstream was already loaded into memory • Enables VAPRES to dynamically place and migrate modules • Restriction – PRRs must be homogeneous (ensures sufficient resources) 9

Overview – MACS Communication Architecture S S S N N N N N N S S S N N N N N N S S S N N N N N N MACS 10 • MACS: Minimal adaptive circuit switching mesh communication architecture • VAPRES requires high-bandwidth, low-latency communication channels inside reconfigurable streaming blocks (RSBs) • Novel communication architecture named SCORES was implemented in 2008 • MACS extends SCORES from linear array topology to mesh topology with few other new features • Features of MACS • Minimal-adaptive routing to explore all possible shortest paths • Selects lowest cost path that best achieves network load distribution • Similar interface ports for nodes and neighboring switch • Any number (<=6) of nodes can be put on a single switch • Unused interface ports, of switches around edges of NoC, can be utilized • Node interface port available in MxN NoC is <= 2(M*N + M + N) • Reduces area overhead of communication architecture per node • Provides low-latency path(s) between frequently communicating node pairs (if attached to same switch) 10

MACS implementation results (1/2) • 9 architectural parameters to play around with • Plotting all combinations is not feasible • Assuming two values of each parameter requires 29 “area usage” plots and 29 “achievable frequency” plots

MACS implementation results (2/2) • Comparison of NoCs • Difficult due to lack of published implementation results from other authors • Representative packet-switching NoC1 • Designed and realized by Barticet al. • 8 modules attached in 2D-mesh topology • 16-bit wide data • Similar circuit-switched NoC, i.e. PNoC2 • Programmable Network on Chip, designed and realized by Hilton et al. • Single switch with 8 modules attached to it • 16-bit wide data • Comparable configuration of MACS • 2x2 mesh of MACS switches • W=16, Ku=Kd=Kl=Kr=Kil=Kir=1 • Comparison Results • 5x faster and 1.5x less area overhead than packet-switching NoC • 2x faster (with slight area overhead) than PNoC Bartic, A., Mignolet, J.Y., Nollet, V., Marescaux, T., Verkest, D., Vernalde, S., and Lauwereins, R. “Highly scalable network on chip for reconfigurable systems”.In Proceedings of International Symposium on System-on-Chip, 2003, pages 79–82. Hilton C. and Nelson B., “PNoC: a flexible circuit-switched NoC for FPGA-based systems”.In Proceedings of Computers and Digital Techniques, 2006, pages 181-188.

Analytical Modeling n-stage Size D n-stage Size C λp λm λm µm µc • Analytical model of SCORES/MACS • Streaming network • FIFO at both ends: • Producer FIFO (of size D), Consumer FIFO (of size C) • Pipelined channel/medium: • n-stage pipeline • Control Feedback Path • n-stage • Phases I • Analysis of producer-medium and medium-consumer pairs • Phase II • Analysis of medium-consumer with feedback 13

Phase-I: Producer-Medium Pair(1/2) Size D λp,k-1 λp,k λp,D-1 λp,1 λp • Pk probability associated with the queue being • in state k i.e. queue having k packets in it • λp = Arrival rate • μm = Service rate • D = System capacity • Flow = Sum of product of λ’s, μ’s and P’s 1 2 k D 0 k+1 P0 Pk P1 P2 Pk+1 PD μm,2 μm,k μm,k+1 μm,D μm,1 Solving for steady state gives λp μm 14 Markov-chain model

Phase-I: Producer-Medium Pair(2/2) Total probability of the system should be 1 1 1/(D+1) PD D (line size) 15

Phase II: Medium-Consumer Pair with control feedback, 2D-Markov Chain Model (1/2) µm λp,1 λp • Streaming network • Number of packets in queue(k) • Recently reached threshold(Q) • Potential Queuing at Q = 0 • Producer is filling with rate λp • Service rate is µm • At k = D-1, queue • switches to de-queuing state • Potential De-queuing at Q = 1 • Producer is filling with reduced • rate λp,1 • Consumer is emptying with µm • Total probability of state Q = 1 gives the Packet drop probability • At k = 1, queue switches to queuing state, i.e. Q=0 λp,1 λp,1 λp,1 λp,1 µm 1 2 i D Q=1 D-1 Pi,1 P1,1 P2,1 Pd-1,1 µm µm µm µm PD,1 λp λp λp λp λp 1 2 k 0 D-1 Q=0 P0 Pk P1 P2 Pd-1 µm µm µm µm k 16

Phase II: Medium-Consumer Pair with control feedback, 2D-Markov Chain Model (2/2) • Probability of FIFO being filled with ‘k’ packets when ρ ≠ 1 • Probability of FIFO being filled with ‘k’ packets when ρ = 1 • Packet Drop Probability when ρ ≠ 1 • Packet Drop Probability when ρ = 1

Real-time Simulation and Profiling of MACS • Setup for basic experiment • One MACS switch with both module interface occupied • Network frequency = Module frequency = 100 MHz • Producer and consumer rates are Poisson process • ROM holds MATLAB generated Poisson distributed intervals based on different λ and µ • Producer/consumer loads its counter with value from ROM and generates/reads a unit of data at counter overflow • ChipScope ILA core captures all FIFO activity • System parameters: FIFO sizes = 512 bytes, Network BW = 400MBps, Producer rate = 40MBps Consumer Rate = 4MBps, (both generates data at Poisson distributed random intervals), Transfer size = 0-128KB • Results • Link utilization = 1/10.35, before consumer FIFO is full (at transfer size ~46KB) • Link utilization = 1/105.8081, after consumer FIFO is full (at transfer size > 46KB) • Both FIFO’s activity and probability distribution of consumer FIFO being ‘almost’ full is also plotted w.r.t to transfer size S S S N N N N N N S S S N N N N N N S S S N N N N N N

Real-time Simulation and Profiling of MACS • Setup for advanced experiment • 3x3 MACS NoC with both module interface occupied for each switch • Network frequency = Module frequency = 100 MHz • Producer and consumer rates are linear • ChipScope ILA core captures all activities such as request establishment, write enables for FIFO (used in link utilization calculation), average number of retrials for establishing a channel, avg. channel establishment latency, etc • Observe aforementioned parameters for various network traffic patterns • Network traffic generation patterns

Overview - Design Automation for Partial Reconfiguration (DAPR) Xilinx Early Access (EA) PR Flow provides PR system design support Existing PR flow is very specialized Requires target device architecture knowledge System designer must manually apply steps Hierarchical coding of HDL design description, synthesis, floorplanning, timing analysis implementation and merge DAPR design flow will mitigate existing PR design flow intricacies Manual Steps Hierarchical HDL design description Modified HDL design description via system designer annotations System designer annotated design constraints (optional) Automated Steps DAPR inputs - modified HDL design description and design constraints (parameters include bitstream size, timing, power) DAPR design exploration - iteratively generates candidate design and compares generated design performance parameters with system designer annotated constraints DAPR output – Final bitstreams if system designer constraints are met otherwise output final bitstreams that match closest to system designer annotated constraints EA PR Flow Manual Steps Automated Steps HDL Design Description DAPR DesignFlow HDL Synthesis HDL Synthesis HDL Design Description Set Design Constraints Design Constraints (optional) Modified HDL Design Description Timing/Place-ment Analysis Timing/Placement Analysis Implement Base Design Implement Base Design DAPR Tool Implement PR Modules Implement PR Modules Merge Merge Final Generated Bitstreams 20

Overview - DAPR Tool Phases and Description Initial input Modified VHDL Top File • Information Extraction • Extract static and PR region instantiations and corresponding HDL design description filenames from top level HDL design description file • Information Collection • Collect and write port connection names and widths within each instantiation to partial reconfiguration automation information file (*.paif) • Resource Estimation and Constraint Generation • Synthesize all HDL design description file with Xilinx XST utility • Read and record estimated slice requirements from generated synthesis log file (.srp) to .paif • Generate connectivity information and PRR floorplan using estimated resources and device information libraries • Bitstream Generation • Implement static region and PRMs with Xilinx’s ngdbuild, MAP, and PAR utilities • Merge top, static, and PRMs with Xilinx’s PR_verify design and PR_assemble utilities to generate final full and partial bitstreams DAPR tool starts here VHDL Top File Phase 1 Information Extraction Static region identification PRRs identification Phase 2 Information Collection PR automation information File (.paif) Phase 3 OverlayGeneration Run script to synthesize modules and estimate resource requirements Device inf.libraries(.dilf) Perform automated floorplanning and write to User Constraint File (UCF) Phase 4 Bitstream Generation Implement and merge design Generated full and partial bitstreams 21

System Designer Annotations and Connectivity Information Examples ------------------------------------------------- --PRR_start :: prm_up, prm_down reconfig : rmodule Port Map( led_in=> rm_in_int, led_out=> rm_out_int); ------------------------------------------------- ------------------------------------- --static_start:: static led_registers : base Port Map( clk=> clk, led_unreg=> rm_out, led_reg=> rm_in); ------------------------------------- -------------------------------------------------------- --bm_start in0 : busmacro_xc4v_l2r_sync_narrow Port Map( input0 => bml2r(0), input1 => bml2r(1), input2 => bml2r(2), -------------------------------------------------------- Connectivity Information Example Counter 32 • A simple example design with two PRRs • Two 32-bit up and down counter modules map to PRR 1 • Two 8-bit up and down counter modules map to PRR 2 • Connectivity information gathered from .paif file and connectivity graph generated for system designer verification Module Name/Type Incoming Connections Outgoing Connections 32 Static Region Base/Static 40 40 8 Counter/PR 32 32 Counter_sm/PR 8 8 8 Counter_sm Design Connectivity Information Table Design Connectivity Graph

DAPR V4LX25 Device Library • Device divided into 3 banks • Bank 0 (left), Bank 1(right), Bank 2(center) • Resource representation • Single letter with prefix of either 1 or 0 • Letters are S for Slices, D for DSP48s, F for FIFO16s, R for RAMB16s, C for DCM’s, G for BUGF’s • Prefix of 0 means resource occupied, 1 means resource vacant • Checking individual values will help identify resource type and also resource availability • Device Library file will be shown in Demo Bank 0 Bank 2 Bank 1

DAPR Overlay Generation • Overlay generation uses Cluster growth algorithm • Cluster Growth Algorithm works in two steps • Linear ordering of modules • Choose seed module from initial set of modules and move to a new set of ordered modules (initially an empty set) • Compute gain for each remaining module (gain is number of connecting nets) • Move module with highest gain to set of ordered modules and repeat from gain computation until no more modules are remaining in the initial set • Place ordered modules on floorplan space • Two types of floorplan growth – Vertical and Diagonal • Current overlay generator floorplans builds vertically • Advantage - bitstream size will be smaller • Disadvantage - routing is difficult and will take longer 1 CLB wide and 16 CLB tall Floorplan Growth Direction Floorplan Growth Direction Floorplan Growths (diagonal (left) and veritcal (right) and colored blocks represent PRMs)

Results – Low-Level DAPR Design Flow Numerical Results Case Study implementation results with a 32 bit counter More design s are under test Cordic FFT Matrix Multiplier 1 CLB wide and 16 CLB tall 25

Kalman Filter Case Study • Data format • For the X and Y coordinates • 16 bits fixed point representation: 1 sign bit; 8 integral bits and 7 fractional bits • For the 2 FIFOs • Implemented using one Virtex-4 BRAM • Each one has 32 bits width (16 for X and 16 for Y) and 512 words depth • The process of the system 26 26

Kalman filter - Introduction • Application • Target tracking in linear system: • Provide accurate, continuously updated information about the position of a target given a sequence of observations about its position. • Dynamic model and measurement model are linear • Noises are Gaussian distributed • The system model: • The dynamic system model: • Uniform velocity motion: • The measurement model: 27 27

Kalman filter algorithm • Initialization • Predict • Predicted state: • Predicted covariance : • Update • Innovation measurement : • Innovation covariance: • Optimal Kalman gain: • Update state estimate: • Update estimate covariance: • The simplified version – Fixed-gain Kalman filter • Difference • The optimal Kalman gain is acquired before processing and keep fixed . • Application • If the system is stationary stochastic process, the Kalman gain does not change. 28 28

Type 1: Fixed-gain Kalman filter • 8 multiplications • Read and write FIFOs for Kalman filter part • The process control • If the FIFO TX is Full, stop writing and reading the data from the FIFO RX. • -> stop processing data • The time interval guarantee • At least 3 clock cycles • Parameters input • Parameters (fixed Kalman gain, initial values) are inputted instead of being pre-programmed in the system 29 29

Results & Analysis • For the flexibility of application, use 8 DSP to Instantiate the multipliers • Resources consumption (V4LX25) • Number of Slices: 280 (2%) Number of DSP48s: 8 (16%) • Maximum frequency 156.2 MHz, Throughput 52 MSPS (3 cycles) • Dynamic power consumption (100MHz CLK) 0.06118 W • Estimated results comparison • Bouncing ball experiment • Fixed-gain Kalman filter is suitable • Results calculated by FPGA are • identical to Matlab 30 30

Type 2: Basic version of Kalman filter • Assuming all noises are non-coherent, four elements in Kalman gain matrix are zero. • 4 divisions and 12 multiplications. 31 31

Results & Analysis • Reduce number of dividers and multipliers by resources reuse • Estimated results comparison • Bouncing ball experiment • Kalman filter gain updates in each • iteration • Results calculated by FPGA • are identical to Matlab 32 32

F4-09: Virtual Architecture and Design Automation for Partial Reconfiguration