160 likes | 287 Views
F1-07: Simulative Performance Prediction of RC Systems for RC course 08. Presented by: Gongyu Wang PhD student, F1, CHREC. Goals, Motivations. Goals Develop the first tool for simulative performance prediction of complex RC systems and apps
E N D
F1-07: Simulative Performance Prediction of RC Systems for RC course 08 Presented by: Gongyu Wang PhD student, F1, CHREC
Goals, Motivations • Goals • Develop the first tool for simulative performance prediction of complex RC systems and apps • Explore design tradeoffs of complex, multi-paradigm systems & applications via modeling and simulation • Motivations • Provide an efficient, comprehensive method of evaluating and prototyping RC systems • Facilitate fast system design tradeoffs • Enable application mapping/decomposition analyses without hardware implementations
RC Simulation Framework • 6 key components of framework depicted in figure • Many key tasks can be completed independently and in parallel • Framework allows arbitrary applications to be simulated on any arbitrary systems • Component models and application scripts can be reused for rapid simulative analyses RC Simulation Framework Diagram • System models driven by application scripts produce simulative performance prediction results • Systems modeled in 2007 included socket-connected FPGA platform (XD1000), PCI-based server cluster (Nallatech cluster), and custom supercomputer FPGA platform (SRC-6)
RC Simulation Framework • Application Scripts • A simple, customable script format provides interface between domains • Scripts characterize high-level behavior of application through defining key events • Key events include network transactions, processor computation blocks, RC core processing, data transfers with RC devices • Simulation speed enhanced by abstracting away computation performed by the non-RC portion of the system • RC Events contain transfer size, core configurations, etc. • MPI Events contain data size, destination and source information, transfer type, etc. • Architecture Modeling • Modeling and simulations performed in discrete-event environment called Mission-Level Designer (MLD) • Hierarchical, block-based modeling environment • Customized models developed via C-style programming Sample Application Script Sample View of MLD Models
Supported Features • Double buffering • A series of double-buffered FPGA core requests can be specified by a single script line • Use the optional area of the script line • CPU computation blocks, performed per data chunk and in parallel with FPGA processing, defined via pre-chunk and post-chunk lines • Power modeling • Exact determination of FPGA power consumption is a complex task • Dependent on results of place-and-route, values of input data that drives signal changes throughout fabric, etc. • Quick FPGA power consumption estimates can be obtained via worksheet method • Such power worksheets provided by both Altera and Xilinx • Power estimates rendered from device technology, resource usage, signal switching rate, clock frequency Example of Single/Double Buffering //Script for double-buffered FPGA execution rc_core_request_db 1 FFT 8192x1024 2 0 comp_prechunk 5.0 comp_postchunk 10.0
SRC-6 System Modeling -- Methodologies • Architecture Components Modeled • Microprocessor Node • MAP Node (user FPGA inside) • Hi-Bar Switch • Interfaces included in MAP and SNAP microprocessor models • FIFO queue for each output port • Delay calculation based on sustainable payload BW and documented latency • Simulation assumptions • Map_allocate and Map_free take constant time to execute • Assumption appears accurate based on experimental observations • FPGA configuration time is measured from benchmark and serves as a constant value in the model • Configured, tunable in parameter file • Models for now only account for simple and common MAP functions • Constrained by the black box model of FPGAs in our framework
List of Simulative Results Compiled • Validation Results • Single-node SAR (Delta, XD1000, SRC-6), two data sets • Single-node MD (XD1000) • Single-node TD (Delta, SRC-6) • Parallel TD w/ two (2) and four (4) nodes (Delta) • Single-node HSI (Delta, SRC-6) • Included use two FPGA cores, ACSM and TD • Simulative Case Studies • SAR performance vs. I/O parameters (Delta) • SAR performance vs. FPGA size (Delta, SRC-6) • SAR performance vs. enhanced core design (Delta, SRC-6, XD1000) • ACSM speedup vs. # of SRAM banks (XD1000) • ACSM speedup vs. # of spectral bands (Delta) • ACSM speedup vs. system size (XD1000, Delta) • HSI vs. system size and data network (Delta, XD1000) • MD speedup vs. data set size (XD1000) • MD speedup vs. system size (XD1000) • MD speedup vs. core design/parallelization strategy (XD1000)
SAR Simulative Studies SAR Validation Summary • SAR notes • Image A = 5616x27990 pixels • Image B = 5616x8192 pixels • SRC-6 contains relatively small FPGA • Only one single-buffered FFT fits on device • Following chart predicts performance when larger FPGA is available
SAR Simulative Studies • In two stages, an FFT and IFFT separated by a singe vector multiply (VM) • Currently, VM is performed by host processor • Enhanced core combining FFT, VM, and IFFT simulated on all three systems for 2 image sizes • Table below summarizes prediction results using enhanced SAR core • On Delta, FPGA now produces speedup instead of slowdown, since I/O bottleneck is minimized • On XD1000 and SRC-6, very little additional speedup is predicted, since FPGA transfers were not a bottleneck in the baseline SAR Enhanced Core Summary
Goals, Motivations, and Challenges • Goals • Research concepts for an RC abstraction layer featured in app formulation stage • Allow specification of design/architecture via standardized high-level descriptions • Create mapping of abstract descriptions into script format that can be used by system models to drive simulative perf. predictions • Demonstrate methods using proof-of-concept case studies • Explore methods for enhanced modeling of FPGA core designs • Motivations • Formulation is often neglected/bypassed during development of RC applications • Promote use of formulation with new abstraction layer and language • Provide user-friendly interface to simulation framework Conceptual Flow of RC Formulation Stage
Introduction • Build RCML on top of AADL • AADL is an SAE standard, recommended by multiple CHREC sponsors • Lacks algorithm exploration constructs, thus RCML will need to add this functionality • RCML should be designed without consideration of AADL mapping • Separation of algorithm model from architecture model • RCML composed of concepts & structures for RC algorithms and apps • Even though we’re building this on top of AADL specs and tools, RCML should be considered separate from AADL • Algorithm specification will stand alone, independent of platform details (to a certain degree) • Stored as pure SW AADL spec • Platform architectures specified independently, based on AADL hardware classes and models • Library of common, tunable components to be included • A mapping procedure and file defined to map RCML algorithm model to architecture model • Mapping files connect otherwise separate alg. and arch. files • A tool will parse software, hardware and mapping files into comprehensive AADL HW/SW system specification Classes of AADL Components
RCML Algorithm Constructs • Not all RC applications easily represented within one modeling domain • Environments like Ptolemy and MLD support domains that include data flow, FSM, discrete-event, continuous time, etc. • Need to support multiple models of computation in formulation • Otherwise, usefulness of formulation language is limited • To address domain issue, multiple classes of function blocks and ports defined • Data ports - used to transmit data sets or streams between data- and/or control-driven blocks • Control ports - used to transmit control signals between blocks, trigger control-driven blocks LabView Supported Programming Domains
RCML Algorithm Constructs • Function blocks represent fundamental element of RCML designs • Function blocks represent individual portions of algorithm • Function blocks defined using pre-conditions, post-conditions, and properties • No code is defined within block, just defined properties • Ports on function blocks define how block interacts with remainder of algorithm • Data-driven function block • A function block that only contains data inputs • Execution triggered by receiving data on input data ports • Combination of input events required for triggering defined in block’s pre-condition • For support of data flow and discrete-event models • Control-driven function block • A function block that contains control input(s) • Execution triggered by changes to received control signals • Support FSM defined behavior • Specialized Controller function block defined for creating application controllers • Allow FSMs to be built inside controller • Controller should only accept and output control signals which drive external control-driven function blocks
Conclusions • Developed and demonstrated framework for timely performance prediction of RC systems and applications • Three classes of RC systems modeled and presented • RC cluster, FPGA Co-processor platform (XD1000), and custom supercomputing platform (SRC) • Simulative experiments conducted on each platform with multiple applications • Synthetic Aperture Radar (SAR), Hyper-Spectral Imaging (HSI), and Molecular (MD) • Proposed RCML to address the formulation stage of RC application development process • Build upon AADL • Architectural modeling methodology inherited from F1-07 • Algorithmic modeling methodology constructed