830 likes | 943 Views
Embedded Systems in Silicon TD5102 Introduction and overview. Henk Corporaal http://www.ics.ele.tue.nl/~heco/courses/EmbSystems Technical University Eindhoven DTI / NUS Singapore 2005/2006. Contents. Trends Platforms Application mapping Design flow Summary. Observation 1: The 3 Cs.
E N D
Embedded Systems in SiliconTD5102Introduction and overview Henk Corporaal http://www.ics.ele.tue.nl/~heco/courses/EmbSystems Technical University Eindhoven DTI / NUS Singapore 2005/2006
Contents • Trends • Platforms • Application mapping • Design flow • Summary H.C. TD5102
Observation 1:The 3 Cs • Convergence of 3 Cs computers, communications and consumer electronics • The computer enters the 3rd fase computing power - networking - intelligent processing • The world is one network wherever, whenever, all information and communication available We get a smart environment H.C. TD5102
System Behaviour Structure Algorithm R/T Logic circuit Physical Observation 2: Current design practise Y-Chart (Gajski-Kuhn) • Design Flow is path in Y chart • Till RT-level largely manual flow H.C. TD5102
System people Task Task Task Paper spec vhdl C verilog ASM Hardware people Software people Integration Observation 3: Informal system specification H.C. TD5102
complexity Process technology + 58% 103 HW gap 102 HW design productivity +21 % SW gap 101 SW productivity + 8 % 4 8 12 16 year Observation 4: design productivity • Yes, we can fabricate the ICs, but … • Can we design them ? • Can we program them ? H.C. TD5102
Load (Sequence: weather, VO1, binary shape, 10Hz, 112 kbit/s, QCIF) 100 % Factor 2 75 % 50 % 25 % 0 % 0 50 100 150 200 250 300 Frame (IPPP ...) Rel. CPU-load for 15 fps 1200% 1000% 800% Order of Magnitude 600% 400% 200% 0% Obervation 5:More dynamic applications * Video P. Kuhn, G. Diebel, “Complexity Analysis of the MPEG-4 VM 8.0,” ISO/IEC JTC1/SC29/WG11/MPEG97/m2862, Fribourg, October 1997 * 3D H.C. TD5102
Processor-Memory Performance Gap:(grows 50% / year) Observation 6: Memory problem Performance µProc: 55%/year 1000 CPU 100 “Moore’s Law” 10 DRAM: 7%/year DRAM 1 1980 1985 1990 1995 2000 Time [Patterson] H.C. TD5102
What do we learn from these observations? We need: • Short Time-to-Market • reuse • short design time • Flexible solution • programmability • reconfigurability • Scalability • Low power • Low cost • QoS control At sufficient performance ! H.C. TD5102
Solution ? • Platforms • HW and SW IP reuse • Standardization (interfaces) • QoS (quality of service) hooks • Advanced Design Flow for Platforms • Raise abstraction level • Tool support • Modeling of Power, Cost, Performance • Predictable design H.C. TD5102
Lecture 1: Introduction • Trends • Platforms • Application mapping • Design flow • Summary H.C. TD5102
What is a platform? A platform is a generic, but domain specific information processing (sub-)system In future available as single chip (SoC), or package (SiP) H.C. TD5102
What is a platform? • HW properties: • One or more programmable processors • Advanced memory organization • Programmable communication network • I/O (highly domain dependent) • Possible extra HW features: • Reconfigurable logic • Domain specific accelerators H.C. TD5102
What is a platform? • SW components: • Standardized RTOS • Proper tooling for platform system design • Compilers, Models, Exploration, Debugging, Simulation, … • Possible extra SW features • Middleware layer on top of OS for features like: • QoS • Domain specific protocols • Domain specific SW interfaces • Control reconfigurable logic • Library components • Distributed / Active network processing • Billing • Security H.C. TD5102
Philips Nexperia Example Platform: Philips Nexperia Available in the Billion Transistor Era • E.g. TI OMAP, Sony Cell, Philips Nexperia, TRIPS, Xilinx Virtex-4 Pro, … H.C. TD5102
Future platforms Example: Smart Networked Devices active packets Virtual Machine Protocols Multimedia (MPEG 21) Network OS library accelerator hardware reconfig. hardware programmable hardware radio H.C. TD5102
Future platform: architecture concept Reconfigurable HW blocks Reconfigurable HW blocks CPUs Accelerators CPUs Accelerators Reconfigurable HW blocks Accelerators CPUs Communication network Memory Memory I/O Level 0 Communication network Level 1 Communication network I/O Level N Memory H.C. TD5102
NoC realization Future platforms Network interface On-chip Network IP core • IP - Isles: • 32 RISC microprocessor ~ 20 Kgates • MPEG decoding ~ 100 Kgates • Wavelet filtering ~ 40 Kgates • SRAM • DRAM • FPGA block H.C. TD5102
Lecture 1: Introduction • Trends • Platforms • Application mapping • Design flow • Summary H.C. TD5102
Platform and platform design Applications SDT system design technology Design technology Platform PDT platform design technology Enabling technologies H.C. TD5102
What is the system designers problem ? Idea Specification Implementation Find for an application (idea/specification) an efficient mapping/implementation on a given realization space, under given constraints (cost, P, E, T, E*D, Throughput, #pins, ..) H.C. TD5102
Processor datapath Data Memory r0 Function Unit(s) r1 Function Unit(s) Load- Store Unit r2 Register file Instruction Memory Decode logic Instruction register Processor control A (single) processor: how does it look inside? H.C. TD5102
b a 2 * * d + + z y e f - + r x Data Dependence Graph (DDG) Mapping: placing operations in space and time d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; H.C. TD5102
cycle 1 * 2 * 3 + 4 + 5 - 6 + How to map these operations? • Architecture 1: • One Function Unit • All operations single cycle latency b a 2 * * d + + z y e f + - x r H.C. TD5102
b a 2 * * d Mul Add-sub + + cycle z 1 y * + e f + 2 * + - 3 x + r 4 - 5 6 How to map these operations? • Architecture 2: • One Add-Sub and one Mul unit • All operations single cycle latency H.C. TD5102
b a 2 * * d Mul Add-sub + + cycle z 1 y * + e f + 2 - 3 x * + r 4 5 + 6 - How to map these operations? • Architecture 3: • One Add-sub and one Mul unit • Add/Sub 1 cycle, Mul 2 cycles H.C. TD5102
x Pareto curve (solution space) x x x T execution x x Specific architecture and code schedule x x x x x x x x x x x x x x x x x x x x x x x x x x 0 Cost There are many mapping solutions Let S be the solution space containing solutions x = (xi), then: x = Pareto point x S, and y S i xi < yi H.C. TD5102
Can we do better? Yes !! • Much better !! • transforming the specification • a different architecture • a different mapping • speculative execution • …… be creative ……….. H.C. TD5102
+ + + + + + Transforming the specification (1) Example: tree height reduction Based on associativity of + operation a + (b + c) = (a + b) + c H.C. TD5102
1 b y z a << + - x r Transforming the specification (2) r = f – e = 2*b + d – (a + d) = 2*b – a; x = z + y; d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; H.C. TD5102
+ + + Changing the architecture: adding more complex units: + + + 4-input adder why is this faster? H.C. TD5102
Changing the architecture: adding more complex units In the extreme case put everything into one unit! Spatial mapping - no control flow H.C. TD5102
Control Flow Graph (CFG) -a- cond? -b- -c- -d- More complex control flow Program part: -a- ; If cond Then -b- Else -c- ; -d- ; H.C. TD5102
Mapping the CFG example: 3 options: what's the best? -a- br c -a- br b -a- br c -b- jmp d -c- jmp d -b- -b- -c- -d- -d- -d- -c- jmp d H.C. TD5102
Why not removing the control flow ? H.C. TD5102
If conversion shortens the schedule -a- br c -a- -b- jmp d cond -b- !cond -c- -c- -d- -d- Using guarded instructions like: r3: add r1,r2,r5; !r3: mul r4,r5,#3 H.C. TD5102
Speculative execution makes it even shorter! -a- br c -a- -b- -c- -b- jmp d -d- -c- -d- Why not executing -d- in parallel? H.C. TD5102
However: Real life much more complex E.g.: MPEG-4 : multimedia Huge requirements: > 10 GOP/s > 6 GB/s > 10 MB storage Software specification: - more than 200 000 lines C - hundreds of files - written by approx. 80 teams H.C. TD5102
Nowadays implementations: - small images - decoding only - not real-time - several W - single task - limited dynamism Can we handle this? Wanted features: - large images (HDTV) - encoding and decoding - real-time - 100 mW (mobile) - multiple tasks - dealing with dynamism H.C. TD5102
Lecture 1: Introduction • Trends • Platforms • Application mapping • Design flow • Summary H.C. TD5102
Embedded system design How to map your application graph A(L,A,D) to hardware graph (L,N,C) L: design level (e.g. architecture, implementation or realization level) A: application components (e.g. tasks, operations, data structures) D: dependences between application components N: hardware components (e.g. processors, ASICs, FPGA,memories) C: connections between hardware components H.C. TD5102
Abstraction levels Level specification System specification level Inter-level transformation: languages: Level 0: Requirements English Idea Is modeled by ES/RT-UML, Esterel, SDL Level 1: Architecture Is implemented by C++, JAVA, Level 2: Implementation C, VHDL, SystemC Compiles into Machine code, Level 3: Realization Hardware modules Exploration search area H.C. TD5102
Design space exploration Level n-1 Design point Cost LT(n-1,n) Exploration at level n Exploration search area Realization global optimum space Exploration search area Design transformation H.C. TD5102
Design space exploration framework- another Y-chart H.C. TD5102
Design flow steps and constraints idea high abstraction level Refinement steps Architecture / Platform constraints Transformation low abstraction level realization H.C. TD5102
Step n Step n+1 Step n Step n+1 Step n+1 Step n In which order should we perform the steps? Decision trees H.C. TD5102
Well-known phase ordering examples • Concurrency versus Data management • e.g. loop partitioning versus array partitioning for a multiprocessor • Scheduling versus Register allocation • Logic synthesis versus Placement and Routing H.C. TD5102
Rule of thumb! • Perform steps with biggest impact first • Biggest impact: • depends on your interest (= cost function) • min. E, P, E*D, D, Area, Npins, ... H.C. TD5102
J c o l u m n s I r o w s Phase ordering example:Why fix data storage/transfer before concurrency management issues? Recursive image processing algorithm on local neighborhoods: (for i : 0 .. I-1 ) :: (for j : 0 .. J-1 ) :: img[i][j]= f(img[i][j-k], old_img[i][j]); H.C. TD5102
J c o l u m n s 2 I 14.4mm (0.7um) r o w s Why fix data storage/transfer before concurrency mngnt issues? • Unrolling outerloop (i) M times: • needed M J-word FIFOs (image lines) • M data paths H.C. TD5102