530 likes | 660 Views
Sesame Opening new doors to Multi-level Design Space Exploration of Embedded Systems Architectures. Andy D. Pimentel. Computer Systems Architecture group. University of Amsterdam. Informatics Institute. Thank you. Questions?. Outline. Background and problem statement
E N D
SesameOpening new doors to Multi-level Design Space Exploration of Embedded Systems Architectures Andy D. Pimentel Computer Systems Architecture group University of Amsterdam Informatics Institute
Outline • Background and problem statement • General overview of modeling methodology • Sesame environment • Application modeling layer • Architecture modeling layer • Mapping layer • Gradual refinement of architecture models • Event refinement using dataflow graphs • Both computational and communication refinement • Current status and future work
Embedded media systems • Modern embedded systems for media and signal processing must • support multiple applications and various standards • often provide real-time performance • These systems increasingly have heterogeneous system architectures, integrating • Dedicated hardware • High performance and low power/cost • Embedded processor cores • High flexibility • Reconfigurable components (e.g. FPGAs) • Good performance/power/flexibility
Rethinking system design • Design complexity forces us to reconsider current design practice • Classical design methods • often depart from a single application specification which is gradually synthesized into HW/SW implementation • lack generalizability to cope with highly programmable architectures targeting multiple applications • also hamper extensibility to efficiently support future applications
Rethinking system design (cont’d) • Traditionally, designers only rely on detailed simulators for design space exploration • HW/SW co-simulation • This approach becomes infeasible for the early design stages • Effort to build these simulators is too high as systems become too complex • The low speeds of these simulators seriously hamper the architectural exploration • HW/SW co-simulation requires a HW/SW partitioning • A new system model is needed for assessment of each HW/SW partitioning
“Jumping down” the design pyramid Specification Back-of-the-envelope calculations Abstract executable models 10000 lines Cycle-true simulation models Mins/ hours 10000+ lines Synthesizable RTL models Hours/ days High Low Effort Abstraction Low High Alternative realizations
Specification Sesame Explore Back-of-the-envelope calculations 1000 lines Abstract executable models Secs/ minutes 10000 lines Cycle-true simulation models Mins/ hours 10000+ lines Synthesizable RTL models Hours/ days Design by stepwise refinement High Low Effort Abstraction Low High Alternative realizations
SesameSimulation of Embedded Systems Architectures for Multi-level Exploration • Provides methods and tools to efficiently evaluate the performance of heterogeneous embedded systems and explore their design space • Different architectures, applications, and mappings • Different HW/SW partitionings • Smooth transition between abstraction levels • Mixed-level simulations • Promotes reuse of models (re-use of IP) • Targets the multimedia application domain • Techniques and tools also applicable to other application domains
Y-chart Design Methodology [Kienhuis] Applications Mapping Performance Analysis Performance Numbers Use separate models for application and architecture behavior Architecture
Modeling and simulation using the Y-Chart methodology Application model Traces of application events Architecture model • Application model • Description of functional behavior of an application • Independent from architecture, HW/SW partitioning and timing characteristics • Generates application events representing the workload imposed on the architecture • Architecture model • Parameterized timing behavior of architecture components • Models timing consequences of application events • Explicit mapping of application and architecture models • Trace-driven co-simulation [Lieverse] • Easy reuse of both application and architecture models!
Application modeling • Using Kahn Process Networks (KPNs) • Parallel (C/C++) processes communicating with each other via unbounded FIFO channels • expresses parallelism in an application and makes communication explicit • blocking reads, non-blocking writes • Generation of application events: • Code is instrumented with annotations describing computational actions • Reading from/writing to Kahn channels represent communication behavior • Application events can be very coarse grain like “compute a DCT” or ”read/write a pixel block”
Application modeling (cont’d) • Why Kahn process networks (KPNs)? • Fit very well to multimedia application domain • KPNs are deterministic • automatically guarantees validity of event traces when application and architecture simulators are executed independently • Application model can also be analyzed in isolation from any architecture model • Investigation of upper performance bounds and early recognition of bottlenecks within application
Architecture modeling • Architecture models react to application trace events to simulate the timing behavior • Accounting for functional behavior is not necessary! • Architecture modeling at varying abstraction levels • Starting at ‘black box’ level • Processing cores can model timing behavior of SW, HW or reconfigurable execution • parameterizable latencies for the application events • SW execution = high latency, HW execution = low latency • Allows for rapid evaluation of different HW/SW partitionings!
Architecture modeling (cont’d) • Models implemented in Pearl • Object-based discrete event simulation language • Keeps track of virtual time • Provides simulation primitives • Inter-object communication via message-passing • Keeps track of simulation statistics • “RISC-like” language: keep it simple and make the common case fast • Lacks features not needed for architectural modeling (e.g., no dynamic datastructures, dynamic object creation, etc.) • Result: high-performance modeling & simulation • High simulation speed and low modeling effort
Architecture modeling (cont’d) • Models implemented in SystemC • We added a layer on top of SystemC 2.0, called SCPEx (SystemCPearl Extension) • Provides SystemC with Pearl’s message-passing semantics • Raises abstraction level of SystemC (e.g., no ports, transparent incorporation of synchronization) • Improves transaction-level modeling • SCPEx enables reuse of Pearl models in SystemC context • Makes Pearl SystemC translation trivial • Provides link towards possible implementation • Facilitates importing SystemC IP models in Sesame
Kahn process Kahn process Kahn process Virtual processor Mapping Virtual processor Virtual processor buffer buffer Processor 1 Processor 2 bus Mem Sesame in layers Application model Event trace Mapping layer Architecture model
Sesame’s mapping layer • Maps application tasks (event traces)to architecture model components • Guarantees deadlock-free schedulingof application events
Scheduling of communication events Because Read events are blocking (Kahn), some schedules may yield deadlock A C Application model B Write(A) Read(C) Read(B) Write(C) Proc. core Proc. core Architecture model Bus
Sesame’s mapping layer • Accounts for synchronization behavior • Mapping layer executes in same time domain as architecture model • Transforms application-level events into primitives (events) for architecture model • More on this later on... • Tool for auto-generation of mapping layer • Maps application tasks (event traces)to architecture model components • Guarantees deadlock-free schedulingof application events
Y-chart Modeling Language (YML) • Flexible and persistent description (XML) of • The structure of application and architecture models (connecting library components) • SCPEx also supports YML! • The mapping of appl. models onto arch. models(i.e., the mapping layer) • YML combines scripting language within XML • Simplifies descriptions of complicated structures • Increases expressive power of components • E.g., a parameterized complex interconnect component modeling a network of arbitrary size • Increases reusability • Re-use of components and structures
Videostream Videostream M-JPEG encoded RGB to YUV (RGBorYUV) (YUV) video stream JPEG encoding conversion observed bitrate An illustrative case study: M-JPEG • Lossy, Motion-JPEG encoder • Accepts both RGB and YUV formats • Includes dynamic quality control by on-the-fly adaptation of quantization and Huffman tables
microProcessor VIP DSP1 DSP2 VOP (mP) Memory The platform architecture • Bus-based shared memory multiprocessor architecture
microProcessor VIP DSP1 DSP2 VOP (mP) Videostream Videostream M-JPEG encoded RGB to YUV (RGBorYUV) (YUV) video stream JPEG encoding conversion Memory observed bitrate M-JPEG case study (cont’d) Exploration mapping
(H,V) RGB2YUV RGB2YUV {NLP,LP} RGB blocks (3:1) Data blocks Q blocks (4:1) Bitstream packets VLE Video in DMUX Video in DMUX VLE Video out Video out Quantizer Quantizer YUV blocks (4:1) DCT blocks (4:1) YUV blocks (4:1) {(H,V),B,b} H-Tables DCT DCT {NT,OT} {NT,OT,EOF} Select_channel Statistics, Bitrate Q-Tables Sequence Sequence Compressed Compressed microProcessor Table-Info of of VIP DSP1 DSP2 VOP video frames video frames OB OB video frames video frames (mP) Control Control B Event traces Videostream Videostream M-JPEG encoded RGB to YUV (RGBorYUV) (YUV) video stream JPEG encoding conversion o10 o11 i7 i6 i3 i3 o10 o11 i7 i6 i3 i3 o9 o9 RGB2YUV mP DSP2 mP VLEP i1 o7 o2 o3 i4 DSP1 i1 o7 o2 o3 i4 VIP VIP DCT VOP VOP i2 o6 i1 i2 o6 i1 o1 o3 Memory i2 o1 o3 i2 observed bitrate o1 o2 i3 i4 i5 o2 o5 o8 o4 i3 o1 i4 i1 i2 o1 o2 i1 i2 o1 o2 i3 i4 i5 o2 o5 o8 o4 i3 o1 i4 i1 i2 o1 o2 i1 i2 BUS(B1) BUS(B1) line 1 line 1 line 1 line 1 HEADER TABLES DCT -> Q Q -> VLE PACKET STATISTICS HEADER TABLES DCT -> Q Q -> VLE PACKET STATISTICS : ... : MEMORY : ... : MEMORY BUFFER BUFFER BUFFERS BUFFERS BUFFERS BUFFER BUFFER BUFFER BUFFERS BUFFERS BUFFERS BUFFER line 8 line 8 line 8 line 8 FIFO IMAGE BUFF 1 IMAGE BUFF N FIFO FIFO FIFO IMAGE BUFF 1 IMAGE BUFF N FIFO FIFO M-JPEG case study (cont’d) • Kahn Process Network • Functional behavior • Library approach • Timing behavior
M-JPEG design space exploration • Experimented with different • HW/SW partitionings • Application-architecture mappings • Processor speeds • Interconnect structures (bus, crossbar and Ω networks) • This took about 1 person-month (all modeling included) • Simulation performance: for 128x128 frames, a 270 MHz Sun Ultra 5 Sparcstation simulated 2,3 frames/second (= 0.43 secs/frame)
Exploration/refinement Mapping problem: implementation gap Application behavioral model(what?) Primitive operations Implementation Primitive operations Architecture model (how?)
Mapping problem • Application events: Read, Write and Execute • Typical mismatch between application events and architecture primitives, examples: • Architecture primitives operating on different data granularities • Architecture primitives more refined than application events • Trace events from the application layer need to be refined • How? • Refine the application model • A transformation mechanism between the application and architecture models
Synchronization primitives Data movement primitives Communication refinement • Let’s take the mismatch of communication primitives as an example • Assume following architecture communication primitives • Check-Data (CD) • Load-Data (Ld) • Signal-Room (SR) • Check-Room (CR) • Store-Data (St) • Signal-Data (SD)
Process B Process A Process C while (1) { read(block); compute(); write(block); } while (1) { read(block); compute(); } while (1) { compute(); write(block); } Communication refinement (cont’d) • Transformation rules for refining application-level communication events [Lieverse] • R CD Ld SR (1) • W CR St SD (2) • E E (3) • How to transform traces of application events using (1), (2) and (3)? Generates REW event sequences
Process B Process A Process C Communication refinement (cont’d) Processor 1 Processor 2 Processor 3 bus Mem • Assumption 1: processor 2 has local (block) memory • Transforming REW event sequences from process B: • R EW CDLdSRECRStSD • Assumption 2: processor 2 has NO local (block) memory • Transforming REW event sequences from process B: • R EW CDCRLdEStSRSD
IDF-based trace transformation • Virtual processors in mapping layer are refined to accomplish trace refinement • Integer-controlled DataFlow (IDF) model describes internal behavior of a virtual processor • Application events specify • what a virtual processor executes • with whom it communicates • Internal IDF model specifies • how the computations and communications take place at the architecture layer
IDF-based trace transformation (cont’d) Processor 1 Processor 2 Processor 3 Mem Process B Application modelProcess network Process A Process C Virtual proc. Y Virtual proc. Z MappinglayerDataflow Virtual proc. X ArchitecturemodelDiscrete event bus
Virtual proc. X Virtual proc. Y Virtual proc. Z Communication refinement revisited Process B Process A Process C Processor 1 Processor 2 Processor 3 bus Mem • Assumption: processor 2 has NO local (block) memory • Transforming REW event sequences from process B: • R EWCDCRLdEStSRSD
Virtual processor Y switch Virtual processor X Virtual processor Z R E W CR CD E CR CD b b X-init decomposes X St Ld St Ld into X-exit SD SR SD SR from/to arch.model X = {Ld,St,E} processor 2 Architecture model Bus Communication refinement revisited (2) Event trace process B Virtual processor Y switch Virtual processor X Virtual processor Z R E W CR CD E CR CD b b St Ld St Ld SD SR SD SR processor 2 Architecture model Bus
Processor 1 Processor 2 Processor 3 R E E E W Mem Computational refinement Process B Process A Process C Virtual proc. X Virtual proc. Z bus
Detailed performance estimates Sesame framework Proposed architecture Putting Sesame to use: An example design flow Compaan/Laura (Leiden University) + Molen (Delft University) Motion-JPEG encoder Architecture simulation environment Reconfigurable architecture framework DCT Experimentation System-level architecture exploration Applications Code suitable for FPGA execution
The MOLEN Prototype Custom Computing unit Recon fig. Microcode unit MJPEG code Pres hift pixel pixel DCT* kernel in_block out_block 2D- IN OUT DCT pixel pixel DCT* Compaan A real implementation using Compaan/Laura/Molen Mapping M-JPEG on the Molen platform architecture The DCT* kernel for k = 1:1:4, for j = 1:1:64, [Pixel (k,j)] = In(inBlock); end end for k = 1:1:4, if k <= 2, for j = 1:1:64, [Pixel (k,j)] = PreShift(Pixel (k,j)); end end [Block] = 2D_dct( Pixel ); end for k = 1:1:4, for j = 1:1:64, [outBlock]=Out(Pixel(k,j)); end end C++ Compiler Laura
System-level simulation experiment • Modeling Molen with DCT mapped onto CCU • Validation against real implementation • Information from Compaan/Laura/Molen used for calibration of architecture model • Apply architecture model refinement • Keep M-JPEG application model untouched • DCT component in architecture model is refined • Operates at pixel level • Abstract pipeline model, deeply pipelined • Other architecture components operate at (pixel-)block level
Processor 1 Processor 2 Processor 3 R E E E W Mem Sesame’s IDF-based model refinement Process B Process A Process C Application model M-JPEG Virtual proc. X Virtual proc. Z Mapping layer Map DCT on CCU and refine Architecture model Molen bus
repeat-begin …,4,4,4 cd/cr 1 cd ld latency 64 64 cr 11..11,11..11,00..00,00..00 case-begin 64 preshift in out 1 t-put 11..11,11..11,00..00,00..00 case-end 2d-dct arch. delay: 91 64 st arch. delay: 1 To/from arch. model 1 sr/sd …,4,4,4 repeat-end DCT virtual processor Event trace scheduler Control trace 63 P2 P1 Block out Type in 2d-dct Block in To/from architecture model
Simulation results • Full software implementation • Simulation: 85024000 cycles • Real Molen: 84581250 cycles • Error: 0.5% • DCT mapped onto CCU • Simulation: 40107869 • Real Molen: 39369970 • Error: 1.9% • No tuning was done!
Where are we going? Some ongoing and future work
NoC modeling • So far, we mainly modeled bus-based systems • Networks-on-Chip (NoC) will be our (near) future • Standardized interfaces • Scalable (point-to-point) networks • Much more complex protocols (protocol stack?) • QoS aspects • Modeling NoCs • Topologies, switching & routing methods, flow-control, protocols, QoS, etc. • Communication mapping • Modeling at multiple abstraction levels • Gradual refinement • Role of IDF models
Kahn process Kahn process Kahn process Virtual processor Virtual processor Virtual processor buffer buffer Latency Op Processor 1 Processor 2 x 50 y 100 z 10 bus Mem Architecture model calibration Initial derivation of latency parameters: • documentation • educated guess • performance budgeting (what is the required parameter range?) Next step: calibration with lower-level, external simulation models or prototypes, e.g. • Instruction set simulators (ISSs) • Compaan/Laura framework
Mixed-level system simulation • “Zoom in” on interesting system components in architecture model • Simulate these components at a lower level • Retain high abstraction level for other components • Saves modeling effort • May save simulation overhead • Integration of external simulation models • ISSs, SystemC models, etc. • Also allows calibration of higher-level models • BUT… • Mixed-level simulation can be complex! • multiple time domains and time grain sizes (synchronization) • differences in protocol and data granularity of components
C A D B P1 P2 P3 B’ A’ C’ c’’ Mem scheduler ISS SystemC Mixed-level system simulation (cont’d) Embedding external models IDF-based refinement
Towards real design space exploration • Sesame supplies basic methods & tools for evaluating application, architecture, and mapping combinations • Simulating entire design space is not an option • More is needed to explore large design spaces • What will be the initial design(s) to evaluate? • How to react when the evaluated architecture does not suffice? • We need steering before and during simulation • Design decisions using analytical modeling • Finding Pareto-optimal candidates using multi-objective optimization • Design evaluation using simulation