650 likes | 745 Views
Structured Codesign for Manycore Systems. Jürg Gutknecht & Lisa (Ling) Liu, ETH Zürich Sofsem Novy Smokovec, January 2011. About Me. 1968 System programming at Swissair 1977 PhD in Mathematics 1981 Joined Niklaus Wirth's Lilith/ Modula team 1985 Sabbatial stay at Xerox PARC
E N D
Structured Codesign for Manycore Systems Jürg Gutknecht & Lisa (Ling) Liu, ETH Zürich Sofsem Novy Smokovec, January 2011
About Me • 1968 System programming at Swissair • 1977 PhD in Mathematics • 1981 Joined Niklaus Wirth's Lilith/ Modula team • 1985 Sabbatial stay at Xerox PARC • 1986 Project Oberon together with Wirth • 2000 Academic languages researcher at MSR
Outline of Talk • Context & Vision • A Structured Approach • Use Cases • Programming Language & Compiler • Power Management Codesign • Hardware Library
Context & Vision Some context of the project and a vision
Microsoft Innovation Cluster • Launched in 2008 by Microsoft (Reseach) • Volume 5 years/ $5 mio • Theme embedded systems software • Participants • ETH Zürich (3 projects) • EPFL Lausanne (4 projects) • Goals • Research in embedded systems • Technology transfer • Education „Supercomputer in the pocket“ is one among them
Supercomputer in the Pocket • Manycore architecture for embedded systems on the basis of programmable hardware (FPGA) • High-performance computing in the small • Generic technology for wide range of apps • Sensor driven medical IT • Data streaming in financial apps • Running robot with limb control • Real time audio processing • Hardware/ software design from the ground up will be focussed in this talk
People Involved • Microsoft Research • Chuck Thacker (consultant) • ETH Zürich • Niklaus Wirth (processor design) • Jürg Gutknecht (project leader) • Lisa (Ling) Liu (hardware design) • Felix Friedrich (compiler) • University Hospital Basel • Alexej Morozow (medical IT app)
The Vision • Custom hardware design for embedded systems • Programmers need no hardware knowledge • System design process at high level of abstraction • Fully automated mapping process to FPGA • FPGA resources are used efficiently
Semantic Gap Program Constructs FPGA Resources • Object • Thread • Data structure • Statement • Communication • I/O • ... • Lookup tables (LUT) • Block RAMs (BRAM), • DSP slices • … Map
An Structured Approach Big picture of our structured codesign approach
Options for How to Achieve It • Hardware compilation: Custom mapping of specific algorithm (or hot spots) to hardware circuits. • Uniprocessor: Single universal processor plus on-chip cache memory. Transparently connected to external memory. • SMP: Several universal processors, each with on-chip cache memory, and each transparently connected to external memory. Cache coherence mechanism needed. • Preconfigured: Several universal processors, each with private on-chip memory. Interconnected via on-chip network. One processor connected to external memory.
A Better Approach • Hardware/ software codesign based on a suitable high-level computing model and programming language • Fully automated mapping/ synthesizing to FPGA hardware based on suitable library of highly configurable hardware components
Our Computing Model • Active Cell (Actor) • Object with private state space • Behavior control thread • Communicating with other actors via channels • Actor Graph • Collection of interoperating actors running in parallel • Some actors connected to I/O via serial port
Our Hardware Library • TRM processor (Tiny Register Machine) • Extremely simple • Two level pipelined instruction execution • Several variants • VTRM (vectors via DSP), DTRM (DMA) • Communication FIFO • Ring buffer • Sizes 32, 64, 128, 1024 • I/O controllers • DDR2, CF, LCD, UART
Mapping Actor Graph FPGA • Actor • Communication channel • I/ O • TRM processor („core“) • Instruction memory • Data memory • FIFO buffer • I/ O controllers connected to cores Map
TRM/ FIFO Cooperation channel FIFO recv M TRM • fully orchestrated by TRM • no interrupts! send channel FIFO
Use Cases Two data driven applications of our system
Realtime Multichannel ECG Monitor • Analyze the activity of the heart, the morphology of the corresponding waves, and the heart rate variability (HRV), with the aim of detecting and classifying potential anomalies • The signal to be analyzed decomposes into 8 physical channels, each of them sampled at 500 Hz
Decomposition into Actor Graph Wave proc_1 Wave proc_2 Signal input QRSdetect HRV analysis Disease classifier ECGbitstream out stream Wave proc_8
Actions • Receive ECG signal from UART, compose individual samples, and distribute them to channel processors. • (Per channel): Precondition wave by suppressing noise via linear filtering; Detect the heart beats and contractions. • Detect QRS patterns and make a final decision about heart rate on the basis of standard multichannel logic. • Analyze the current heart rhythm and the heart rate variability (HRV). • Use decision tree logic to detect and classify arrhythmia events such as premature ventricular contractions (PVC), ventricular tachycardia etc. Feed results back to configure wave processing.
Xilinx Virtex-5 FPGA Development board
FIFO20 FIFO1 FIFO9 FIFO19 TRM2 ECG Resulting FPGA configuration TRM3 RS232 UART Ctrl CF Ctrl CF TRM4 TRM1 TRM10 TRM11 TRM12 FIFO17 FIFO18 LCD Ctrl LCD TRM9 FIFO33 FIFO8 FIFO16 FIFO34
Use of Resources • ECG Monitor • Maximum number of TRMs in communication chain
Comparative Power Usage • Preconfigured FPGA (TRM, IM/ DM, I/O, interconnect) • Fully configurable 86% saving!
Graphics Based Motion Detection • Problem: Detect moving objects in a series of image frames • Approach: Parallelize detection process by domain decomposition (into 4 parts) • Design: A reader process continuously reads frames from external memory and forwards them to (4) part-detection processes running in parallel and reporting detected movements
Performance Results • Data base • 10 frames of resolution 576 x 768 (432 KP) • Estimated performance • Transfer from external DDR2 memory ca. 40 MP/sec • Computation: 4 x 31 MP/sec • Total time used per frame 55 ms • Total throughput 18 frames/ sec
Program Language & Compiler Programming language & automated mapping
The ActiveCells Language • History & Profile • Evolution of Pascal, Modula, Oberon • Actor based • Compositional • Active cell (Actor) • Object with active behavior, communicating via channels • Assembly • Network of interoperating active cells • Reusable software component with ports interface
Example of Functional Actor • F = actor (in1, in2: instr; out: outstr);vari, j: integer;beginlooprecv(in1, i); recv(in2, j); send(out, someOp(i, j))endend
Example of User Interface Actor • UI = actor (out1, out2: outstr; in: instr);var i, j, k: INTEGER;beginloop RS232.RecvInt(i); RS232.RecvInt(j); send(out1, i); send(out2, j); recv(in, k); RS232.SendInt(k)endend
Examples of Assemblies • Assembly without ports • Assembly with ports out delegate A B connect G out UI F in out RS232 actor in1 in2 out1 out2 in1 in2 F F out out in1 in2 in1 in2 in1 in2 in3 in4
Assembly A Code • assembly A; (*without ports*) import RS232;typeF = actor (in1, in2: instr; out: outstr); UI = actor (out1, out2: outstr; in: instr);varifc: UI; f: F;begin new(ifc); new(f); connect(ifc.out1, f.in1); connect(ifc.out2, f.in2); connect(f.out, ifc.in)end A.
Assembly B Code • Assembly B (in1, in2, in3, in4: instr; out: outstr); (*with five ports*)typeF, G = actor (in1, in2: instr; out: outstr);varf1, f2: F; g: G;begin new(f1); new(f2); new(g); connect(f1.out, g.in1); connect(f2.out2, g.in2); delegate(in1, f1.in1); delegate(in2, f1.in2); delegate(in3, f2.in1); delegate(in4, f2.in2); delegate(out, g.out)end B.
Built-In Vector Types and Operators • Runge-Kutta (x, x1, k1, k2, … 3d vectors) • while t <= tmaxdo k1 := f(t, x); k2 := f(t + dt/2, x + dt/2 * k1); k3 := f(t + dt/2, x + dt/2 * k2); k4 := f(t + dt, x + dt * k3);x1 := x + dt/3 * (1/2 * k1 + k2 + k3 + 1/2 * k4); Draw(x, x1); x := x1; t := t + dt;end
Built-In Matrix Types and Operators • Graphics pipeline (Matrix multiplication) • M := Graphics.Proj(left, right, bot, top, near, far)* Graphics.Trans(0.0, 0.0, -d)* Graphics.RotX(elev)* Graphics.RotY(-azim)* Graphics.Trans(0.0, 0.0,- zm)
Actor Code • F = actor (in1, in2: instr; out: outstr);vari, j: integer;beginlooprecv(in1, i); recv(in2, j); send(out, someOp(i, j))endend TRM
Assembly Code • assembly B (in1, in2, in3, in4: instr; out: outstr);typeF, G = actor (in1, in2: instr; out: outstr);varf1, f2: F; g: G;begin new(f1); new(f2); new(g); connect(f1.out, g.in1); connect(f2.out2, g.in2); delegate(in1, f1.in1); delegate(in2, f1.in2); delegate(in3, f2.in1); delegate(in4, f2.in2); delegate(out, g.out)end B. Verilog
Automated Mapping to FPGA source program runtime library TRMcode hybrid compiler scripts make.tcl, ram.bmm Verilog code memory images.mem hardware library Xilinxsynthesizer bits
Program Model Refinement • Each thread may spawn any number mutually independent sub-threads • Advantages • Allows (lock-free) fine-grained parallel computing • Requirements • Needs core clustering • Needs runtime scheduling support • Needs barrier mechanism spawn A1 A A1 A2 barrier
Next Step • Use the ActiveCells language for developing embedded software on top of some standard IDE • Including design, programming, debugging, analyzing • Analyzer may need cycle accurate simulator • Use fully automated tool to generate an FPGA image burn down
Power Management Codesign Integrated HW/SW power management system Collaboration with Prof. Shiao-Li Tsao, National Chiao Tung University, Taiwan
Clock Gating Strategy with clock always on with clock gating
Power Management as Add-On • Clock gating • PM Add-On generated automatically on demand • actor{ PM } (...); • Instruction • clockOff() • Control registers • TRM mode, clock rate, voltage • Signals • Data on port • I/O ports • Interop with PM controller • Internal memory • backup TRM state/ registers TRM PMAdd-On Circuitry data clk out in
Clock Gating Off Procedure TRM data PM Add-On Circuitry signal PM controller clk out in clk Clock Manager PM Controller stop clock
Clock Gating On Procedure Data arrives TRM processor resumes data PM Add-On Circuitry clk out in clk Clock Manager PM Controller PM controller feeds in clock