Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna

Embedded Systems General purpose systems Embedded systems Microprocessor market shares

Example Area: Automotive Electronics • What is “automotive electronics”? • Vehicle functions implemented with electronics • Body electronics • System electronics: chassis, engine • Information/entertainment

Automotive Electronics Market Size Cost of Electronics / Car ($) 1400 1200 1000 800 2006: 25% of the total cost of a car will be electronics 600 400 200 0 1998 1999 2000 2001 2002 2003 2004 2005 Market ($billions) 8.9 10.5 13.1 14.1 15.8 17.4 19.3 21.0 90% of future innovations in vehicles:based on electronic embedded systems

Automotive Electronics Platform Example Source: Expanding automotive electronic systems, IEEE Computer, Jan. 2002

Entertainment Communication Broadcasting Computing Imaging Telematics Digital Convergence – Mobile Example • One device, multiple functions • Center of ubiquitous media network • Smart mobile device: next drive for semicon. Industry

4th Gen and Next-Gen Networks Includes: 802.20, WiMAX (802.16), HSDPA, TDD UMTS, UMTS and future versions of UMTS

SoC: Enabler for Digital Convergence 4G/5G, DMB, WiBro, etc. Future Performance Low Power Complexity Storage > 100X Today SoC

Application pull 3D gaming 1TOPS/W 3D TV 3D ambient interaction Structured decoding Ubiquitous navigation 3D projected display Autonomous driving HMI by motion Gesture detection Structured encoding 100GOPS/W Expression recognition Gbit radio Collision avoidance H264 encoding Adaptive route Language dictation Gesture recognition Emotion recognition UWB A/V streaming Sign recognition 5 GOPS/W Image recognition 802.11n Si Xray Mobile Base-band H264 decoding Auto personalization Fully recognition (security) 2005 2007 2009 2011 2013 2015 [IMEC] Year of Introduction

MPSoC Platform Evolution Middleware, RTOS, API, Run-Time Controller Applications Software opt. • Today’s SoCs could fit in 1 tile!! • Tile-based design Mapping V,Vt,Fclk,IL I/O P E R I P H E R A L S 45 nm router Bus based Multi Proc 2 <4mm Net Int 3D stacked main memory 30Mtr Local Memory hierarchy Power Test Mgmt <1GHz

Picochip PC102 Ambric AM2045 Cisco CSR-1 Intel Tflops Raza XLR Cavium Octeon Raw Cell Niagara Opteron 4P Boardcom 1480 Xeon MP Xbox360 PA-8800 Tanglewood Opteron Power4 PExtreme Power6 Yonah Multicores Are Here! [Amarasinghe06] 512 256 128 64 # of cores 32 16 8 4 2 4004 8080 8086 286 386 486 Pentium P2 P3 Itanium 1 P4 8008 Athlon Itanium 2 1970 1975 1980 1985 1990 1995 2000 2005 20??

MPSoC – 2005 ITRS roadmap [Martin06]

SOC SoC  Solution-on-a-Chip Requires design of Hardware AND software Target System Application Application S/W System / Service Middleware Mobile Terminal System + e-SW RTOS Module HAL Chip Chip S/W IP Process

Design as optimization • Design space The set of “all” possible design choices • Constraints Solutions that we are not willing to accept • Cost function A property we are interested in (execution time, power, reliability…)

Hardware synthesis

Behavioral synthesis

Allocation, Assignment, and Scheduling Techniques Well Understood and Mature

Control Step Resource constraints Control Step

Scheduling under resource constraints • Intractable problem • Algorithms: • Exact: • Integer linear program • Hu (restrictive assumptions) • Approximate : • List scheduling • Force-directed scheduling

ILP formulation • Binary decision variables: X = { xil, i = 1,2,…. n; l = 1,2,…, λ + 1} xil, is TRUEonly when operation vi starts in step l of the schedule (i.e. l = ti) λis an upper bound on latency • Start time of operation vi: Σl .xil l

ILP formulation constraints • Operations start only once Σxil= 1i = 1, 2,…, n • Sequencing relations must be satisfied ti≥ tj + dj (vj, vi) є E Σl • xil–Σl • xil–dj ≥ 0(vj, vi) є E • Resource bounds must be satisfied Simple case (unit delay) Σ xil ≤ ak k = 1,2,…nres ; l l A A l l A i:T(vi)=k

ILP Formulation min (Σl • xnl) such that Σxil = 1 i = 1, 2, …, n Σl • xij - Σl• xjl - dj ≥ 0 i, j = 1, 2, …, n, (vj, vi) є E Σ Σxim≤ akk = 1, 2, …, nres ; l = 0, 1, …, λ l l l l l i:T(vi)=k m=l-di+1

0 NOP NOP 1 2 6 8 10 - + - * * * < * * * + 3 7 9 11 4 5 n Example • Resource constraints: • 2 ALUs; 2 Multipliers • a1 = 2; a2 = 2 • Single-cycle operation • di = 1 i A

Example • Operations start only once x11 = 1 x61 + x62 =1 … • Sequencing relations must be satisfied x61 + 2x62 – 2x72 – 3x73 + 1 ≤ 0 2x92 + 3x93 + 4x94– 5xN5 + 1 ≤ 0 … • Resource bounds must be satisfied x11 + x21 +x61 + x81≤ 2 x32 + x62 + x72 + x81≤ 2 …

0 NOP NOP 1 2 10 < - * * * + * + * * - TIME 1 3 6 11 TIME 2 8 4 7 TIME 3 9 5 TIME 4 n Example

Resource-EfficientApplication mapping for MPSoCs MULTIMEDIA APPLICATIONS Given a platform • Achieve a specified throughput • Minimize usage of shared resources

Optimization Development . . ( Application design flow Abstraction gap • The abstraction gap between high level optimization tools and standard application programming models can introduceunpredictableand undesired behaviours. • Programmers must be conscious about simplified assumptions taken into account in optimization tools. • New methodology for multi-task application development on MPSoCs. Platform Modelling Starting Implementation Optimization Analysis Final Implementation Optimal Solution Platform Execution

Max time wheel period T Assumed To be infinite Max bus bandwidth Resource assignment and scheduling THE SYSTEM Task. A (WCET Ta) Processor Task. B (WCET Tb) . . . . . Limited Size Mem Tightly-Coupled Memory Task. N (WCET Tn) Node 1 Node N Bus Interface SHARED SYSTEM BUS On-chip Memory

T0 T1 T2 T3 T7 ….. The application Signal Processing Pipeline Throughput Constraint • Each task is characterized by: • WCET • Memory requirements • Queues for inter-processor communication in TCM for efficiency reasons • Program data in TCM (if space) or on-chip memory • Internal state in TCM (if space) or on-chip memory

Communication-aware Allocation and Scheduling for Stream-Oriented MPSoCs Signal Processing Pipeline T0 T1 T2 T7 ….. • Simplifying assumptions vs predictability • Efficient solutions in reasonable time • Pure ILP formulations suitable for small task sets • Widespread use of heuristics ? ARM7 Private Memory B U S Local Scratchpad Memory Message- oriented MPSoC architecture ……………….. ………. Private Memory ARM7 Local Scratchpad Memory

Master Problem model • Assignment of tasks and memory slots(master problem) • Tij= 1 if task i executes on processor j, 0 otherwise, • Yij =1 if task i allocates program data on processor j memory, 0 otherwise, • Zij =1 if task i allocates the internal state on processor j memory, 0 otherwise • Xij =1 if task i executes on processor j and task i+1 does not, 0 otherwise • Each process should be allocated to one processor Tij= 1 for all j • Link between variables X and T: Xij= |Tij – Ti+1 j | for all i and j (can be linearized) • If a task is NOT allocated to a processor nor its required memories are: Tij= 0  Yij =0 and Zij =0 • Objective function memi (Tij - Yij) + statei (Tij - Yij) + datai Xij /2 i i j

Improvement of the model • With the proposed model, the allocation problem solver tends to pack all tasks on a single processor and all memory required on the local memory so as to have a ZERO communication cost: TRIVIAL SOLUTION • To improve the model we should add a relaxation of the subproblem to the master problem model: • For each set S of consecutive tasks whose sum of durations exceeds the Real time requirement, we impose that their processors should not be the same  WCETi > RT   Tij |S| -1 i  S i  S

Sub-Problem model • Task scheduling with static resource assignment(subproblem) i

Sub-Problem model • Task scheduling with static resource assignment(subproblem) • We have to schedule tasks so we have to decide when they start • Activity Starting Time: Starti::[0..Deadlinei] • Precedence constraints: Starti+Duri  Startj • Real time constraints: for all activities running on the same processor  (Starti+Duri ) RT • Cumulative constraints on resources processors are unary resources: cumulative([Start], [Dur], [1],1) memories are additive resources: cumulative([Start],[Dur],[MR],C) What about the bus?? i

Bus model Unary resource: granularity clock cycle BANDWIDTH BIT/SEC Execution time taski and task j Max bus bandwidth TIME Taski state write Taskj State write Taski state read Taskj state read Arbitration mechanism that decides the bus allocation

Bus model BANDWIDTH BIT/SEC Additive bus model Max bus bandwidth Size of program data TaskExecTime Task0 accesses input data: BW=MaxBW/NoProc taski taskj TIME Taski state write Taski state write Taski state read Taskj state read The model does not hold under heavy bus congestion (more than 65% of total bandwidth) Bus traffic has to be minimized

solution Sub- Problem Master Problem solution no good CP solver IP solver No good generation • Assignment of tasks and memory slots(master problem) • Task scheduling with static resource assignment(subproblem) • If no feasible schedule exist for the allocation provided by the master a no-good is generated. • We use the simple BUT EFFECTIVE one: identify CONFLICTING RESOURCES CR. For each R  CR, STR set of tasks allocated on R  TiR  | STR | - 1 • Other cuts are also possible, [Hooker, Constraints 2005], but these are enough for our case and easy to extract i  STR

Computational efficiency • CP and IP formulations simplified • Hybrid approach clearly outperforms pure CP and IP techniques • Search time bounded to 15 minutes • CP and IP can found a solution only in 50%- of the instances • Hybrid approach always found a solution

Validation of bus model • Requesting more than 65% of the theoretical maximum bandwidth causes the additive model to fail. • Lower threshold in presence of communication hotspots (50%) • Benefits of the additive model • task execution time almost indep. of bus utilization • Performance predictability greatly enhanced

Validation of optimizer solutions • MAX error lower than 10% • AVG error equal to 4.7%, with standard deviation of 0.08 • Optimizer turn out to be conservative in predicting infeasibility • The flow was successfully applied to GSM benchmark

Energy-EfficientApplication mapping for MPSoCs MULTIMEDIA APPLICATIONS Given a platform • Achieve a specified throughput • Minimize power consumption

T8 T7 T1 T2 T4 T3 T5 T6 … Proc. 1 Proc. N Proc. 2 T1 INTERCONNECT T2 T3 … Private Mem Private Mem Private Mem T5 T4 T6 Deadline T7 Resources T3 T5 T7 T8 T1 T2 T4 T8 Time Application Mapping Allocation • The problem of allocating, scheduling and freq. selection for task graphs on multi-processors in a distributed real-time system is NP-hard. • New tool flows for efficient mapping of multi-task applications onto hardware platforms Schedule & Freq.sel.

Exploiting Voltage Supply • Supply voltage impacts power and performance • Circuit slowdown T=1/f=K/(Vdd-Vt)a • Cubic power savings P=Ceff*Vdd2*f • Just-in-time computation • Stretch execution time up to the max tolerable Fixed voltage + Shutdown Power Variable voltage Available time

Vdd Energy/speed trade-offs:varying the voltages Vbs Power P Slack t t2 t3 t1 deadline t t3 t1 t2 deadline Mapping and scheduling: given (fastest freq.) Scheduling & Voltage Scaling Different voltages:different frequencies CPU f1 f2 f3

Target architecture - 2 • Homogeneous computation tiles: • ARM cores (including instruction and data caches); • Tightly coupled software-controlled scratch-pad memories (SPM); • AMBA AHB; • DMA engine; • RTEMS OS; • Technology homogeneous (0.13um) industrial power models (ST) • Variable Voltage/Frequency cores with discrete (Vdd,f) pairs • Frequency dividers scale down the baseline 200 MHz system clock • Cores use non-cacheable shared memory to communicate; • Semaphore and interrupt facilities are used for synchronization; • Private on-chip memory to store data.

Application model • A task graph represents: • A group of tasks T • Task dependencies • Execution times express in clock cycles: WCN(Ti) • Communication time (writes & reads) expressed as: WCN(WTiTj) and WCN(RTiTj) • These values can be back-annotated from functional simulation WCN(T2) WCN(T4) WCN(WT2T4) WCN(RT2T4) Task2 Task4 WCN(WT1T2) WCN(RT1T2) WCN(WT4T6) WCN(RT4T6) WCN(T1) WCN(T6) Task1 Task6 WCN(WT1T3) WCN(RT1T3) Task3 Task5 WCN(WT5T6) WCN(RT5T6) WCN(WT3T5) WCN(RT3T5) WCN(T3) WCN(T5)

Efficient Application Development Support • In optimization tools many simplifying assumptions are generally considered • The neglecting of these assumptions in software implementation can generate: • unpredictable and not desired system-level interactions; • make the overall system error-prone. • We propose an entire framework to help programmers in software implementation: • a generic customizable application template  OFFLINE SUPPORT; • a set of high-level APIs  ONLINE SUPPORT. • The main goals of our development framework are: • theexact and reliable application’s execution after the optimization step; • guarantees about high performanceandconstraint satisfaction.

Customizable Application Template • Starting from a high level task and data flow graph, software developers can easily and quickly build their application infrastructure. • Programmer can intuitively translate high level representation intoC-code using our facilities and library. • Users can specify: • the number of tasks included in the target application; • their nature (e.g. branch, fork, or-node, and-node); • their precedence constraints (e.g. due to data communication); ….thus quickly drawing its CTG schema. • Programmer can focus onto the functionalities of the tasks: • the main effort is given to the more specific and critic sections of the application.

OS-level and Task-level APIs • Users can easily reproduce optimizer solutions, thus: • Indirectly neglecting optimizer’s abstractions • Task model; • Communication model; • OS overheads. • Obtaining the needed application constraint satisfaction. • Programmer can allocate to the right hardware resources • Tasks; • Program data; • Queues. • Scheduling support APIs • Frequency and voltage selection; • Communication issues • Shared queues; • Semaphores; • Interrupts.

P1 N1 Example T1 P2 a1 a2 fork T2 B2 T3 B3 branch branch a3 a4 a5 a6 • Number of nodes : 12 • Graph of activities • Node type • Normal, Branch, Conditional, Terminator • Node behaviour • Or, And, Fork, Branch • Number of CPU : 2 • Task Allocation • Task Scheduling • Arc priorities • Freq. & Voltage T5 T6 T4 T7 C5 C6 C7 C4 a7 a8 a9 a10 or T10 N9 T8 T9 N8 N10 a12 or a11 //Node Type: 0 NORMAL; 1 BRANCH ; 2 STOCHASTIC uint node_type[TASK_NUMBER] = {1,2,2,1,..}; uint queue_consumer [..] [..] = { {0,1,1,0,..}, {0,0,0,1,1,.}, {0,0,0,0,0,1,1..}, {0,0,0,0,..}..}; #define TASK_NUMBER 12 N11 T11 a13 #define N_CPU 2 uint task_on_core[TASK_NUMBER] = {1,1,2,1}; int schedule_on_core[N_CPU][TASK_NUMBER] = {{1,2,4,8}..}; //Node Behaviour: 0 AND ; 1 OR; 2 FORK; 3 BRANCH uint node_behaviour[TASK_NUMBER] = {2,3,3,..}; and a14 T12 T12 Deadline Resources B3 B3 C7 N10 N10 C7 N1 B2 C4 N8 N11 T12 T12 Time

Luca Benini Lbenini@deis.unibo.it DEIS Università di Bologna