280 likes | 557 Views
Predictable performances for embedded systems _______ Dagsthul seminar September 5 th , 2007 Marc Duranton. Introducing NXP Semiconductors. Established: October 2006 (formerly a division of Philips: Philips Semiconductors )
E N D
Predictable performances for embedded systems_______Dagsthul seminarSeptember 5th, 2007Marc Duranton
Introducing NXP Semiconductors • Established: October 2006 (formerly a division of Philips: Philips Semiconductors) • “A start-up with more than 50 years of experience in semiconductors” • Provides engineers and designers with semiconductors and software that deliver better sensory experiences • Top-10 supplier with Net sales: € 4.96 billion ($ 6.58 billion) in 2006 • Sales: 35% Greater China, 31% Rest of Asia, 25% Europe, 9% North America • Headquarters: Eindhoven, The Netherlands • Business Units: • Mobile & Personal • Home • Automotive & Identification • Multimarket Semiconductors • Emerging Businesses; including an independent software business, NXP Software • Employees: Approximately 37,000 people in more than 20 countries • R&D: € 965.9 million in 2005 • More than 25 000 patents • More than 24 R&D centers world-wide
Embedded systems: our scope • Embedded systems are omnipresent • Mobile phones, PDAs, music/video players… • Car radio, DVD, navigation control, driving control, engine control… • TVs, DVDs, game stations, PVR, … • And this is only the part related to vibrant media, no to control/command • Their complexity is always increasing • Specification at the limit of Moore’s law feasibility • Multiples applications on the same device • More complex applications • Shorter time to market more and more programmable devices
Mobilephones • More and more • an “universal • digital tool”: • Phone (still) • Still/moving camera • MP3 player • Video player • GPS • Agenda/calendar • Email • … Just under 1 bln mobile phones were made in 2006, making a chip market worth $41 bln. 11% of mobile phones were video capable in 2006 Cell phones dwarfed PCs in shipments in 2005
Complexity evolution for de-interlacing Simple motion models Pel accurate ME/MC 1 frame and 1 field i/p SD@60 fields/sec 8 bit YUV 4:2:0 • 2DGST based de-interlacing: 4-field algorithm • Complexity of about 400 GOPS for HD • Halo Reduced Picture-rate Up-conversion: • Complexity of more than 1000 GOPS for HD • 7 GB/s bandwidth to external memory without clever memory hierarchy • Reduced to 1 GB/s with hierarchy of software controlled memories Complex motion models ¼-pel accurate ME/MC 4 fields i/p HD@60 fields/sec 10 bit YUV 4:2:2
Typical CE device Streaming: most demanding in processing power Control Streaming I/O ApplicationProcessing ModemProcessing Interfaces ModemProcessing MediaProcessing MediaProcessing Interfaces • AV Encode, Decode • AV Post-processing • AV Analysis • Fixed Function HW • Engines: DSPs, • mediaprocessors, • WLAN, PAN • Cellular • Broadcast • Cable & Satellite • Engines: DSPs, • vector, HW • Application Software • Middleware + OS • Engines: ARM, MIPS, … • RF front end • AV Interfaces • PC Interfaces • 80% of the software • 20% of the computation • 20% of the software • 80% of the computation
500 Gops to 1 Tera operation/s 100mW to 5 W 10 € Predictable (SDR, audio/video, security,…) 10 to 100 “cores” (VLIW, in order, scratchpads) (Heteregeneous, ASIP) 10 to 100x 10 to 50x 10x Mostly not supported 1 to 10x (Superscalar, predictors, OoO) (Homogeneous) Challenge for hardware: customer requirements for 2010+ Ratio to current GP processor Embedded processor IP
MBS VMPG TM3260 TDCS VIP MIPS3960 TM3260 QVCP5L MSP MDCS QVCP2L Embedded systems use “multicore” in products for long time • TM-2700: 2 heterogeneous VLIW cores • 1997 • Viper-2: NXP set-top box and digital TV SoC: • 0.13 m (in 2002) • 50 M transistors • 250 MHz • 100 clock domains • > 60 IP blocks • 250 RAMs • > 100 Gops • 4 W
VIPER TM2 Trends: the Basics • Moore’s Law is driving us • ~58% yearly growth in number of transistors
Specific challenges of streaming systems • Streaming systems require sustained, guaranteed performances, not always best effort • E.g. 60 fps for video processing • “zero stock” strategy = minimum buffer size • In average, producer rate = consumer rate (on-time processing best efforts) • Only “required” clock frequency so lower power • Predictability is important (e.g. automotive) • (Hard ?) Real-time performances • Missed event => missed data (e.g. SDR) • Synchronization (e.g. lipsync) => Major difference from PCs, workstations requirements Importance of “non functional” requirements e.g. Time In the past: mainly implemented with dedicated hardware => More and more programmable solutions (HW + SW)
Example of software functions in a 2002 DTV Streaming Audio In Audio System (AC3+Prologic+mixer) Audio Out I2S Input To Speakers & Headphones TS Input/ Video In 1 Natural Motion TS/656 Input Window Manager HD video Out TS Input/ Video In 2 MPEG Driver TS/656 Input SD video Out To TV Monitor ATSC TP Demux To VCR VBI MPEG MP@HL Video Decode Content OSD HTML NTSC CC ATSC EPG/PSI ATSC CC Control I2C Output SSI Set Control Modem System Control POTS User Input (Interrupt) RC5/6 Input Ser/Par interface PCI/XIO Bus Remote Control Serial/Parellel Interfaces
Streaming system programming is not like computer programming • The notion of “time” is key in the field • “on-time” more that “best efforts” • And using all levels of parallelism: • DLP (Vectorization), ILP (VLIW), Pipeline of functions, TLP • Most of the computer innovations of XXth century rely on best effort model, or time being irrelevant e.g. programming languages, caches, speculative execution, … (see Ed. Lee presentations on the topic, e.g. “The future of Embedded Software”) • Only the order of execution remains (sequential execution) • No “when” (10h18) nor “how long” (53s) • Concurency not supported as well (reactive functions)
Space/time travel… • The hardware domain/tools work in time, but the various software programming abstractions discard it • Hardware can work in • Time: successions of operations • Typical “clock race” • Space: several operations in parallel => Performance = Space x Time • Current programming models don’t express well parallelism • But Hardware Description Languages do • Verilog, VHDL, Esterel … • Also hardware tools for timing closure, etc softwares (at all levels) should also support the notion of time • Today: after thought delegated to real-time OS
Current programming flow • Application description: sequential “C/C++” code, reference code, or algorithm in Matlab, … • Fine grain parallelism and optimisation: mostly by hand • Custom ops, code rewriting, … • Coarse grain partitioning, communication: threads, librairies (in NXP: YAPI, TSSA, TTL) • Pseudo Kahn process Networks (fixed FIFOs, “select” operation) • Synchronisation, time requirements, global integration: RTOS • But how to guarantee performances in a complex system? • Validation by simulations and test and errors • Not possible anymore due to the complexity increase and explosion of use cases • A lot of manual rewriting of applications • Tools and methodology should automate more the process
Attempt of a solution (2001):Hierarchical Process Networks (HPN) • Hierarchy/nesting of (Kahn-like) processes (the « H » in HPN) • Communications of structured data • Modelling of periodically resynchronised streams • Relaxed periodicity, or «burstiness» • Instantaneously: Producer ≠ Consumer rate • In average: Producer = Consumer rate • Key uses of HPNs • Application consistency checking • Detection of deadlocks, help in dimensioning buffers • Bandwidth and resource usage estimation • Prediction of the performances • Model of LAGS (Locally Asynchronous, Globally Synchronous) • Devices are pseudo asynchronous at low level (buffers) • Hierarchical synchronization at higher level • Static schedule • Predicable performances and behaviour
param N node Main prod cons avg 91 kHz Example of Sally code Producer-consumer pair, activated at 91 kHz node Prod (paramint N, outpixel line[N]) { index i [N] line[i] = some_expression(N) } node Cons(param int N, in pixel line[N], out pixel average) { average = SUM (i[0..N-1], line[i]) / N } node Main (paramint N) { pixel buffer[N] // buffer between producer and consumer pixel avg // average value clock hSync 91 kHz { VOID -> prod(N) -> buffer buffer -> cons(N) -> avg } every hSync // new values produced every tick! }
Next step: N-Synchronous model and relaxed clock calculus Taking benefits of both KPN and synchronous languages Bridging KPN and synchronous approaches See our paper published in PoPL 2006
More details by Albert Cohen in his talk ! Current research: event/time-stamps model Main goals: • Determinism (as much as predictable as possible) • Deadlock freeness • Composition, modularity • Resource analysis: • Boundedness, buffer size inference • Delay and bandwidth inference • Real-time constraints and support for (time) variability • Express middle-end optimizations • Intraprocedural: loop/scalar transformations, undisturbed by concurrency constructs • Interprocedural: inlining, specialization • Interprocess: data-flow analysis/optimization across communications • Concurrency: fusion, partitioning (data //-lism), pipelining (control //-lism) • Target for multiple high-level programming models • Well-behaved C with semantically neutral hints/annotations • Synchronous languages • Relaxed synchronous high-level languages • ACOTES SPM (OpenMP extensions)
Related work • Kahn process networks (G. Kahn 1974) • COMPAAN (B. Kienhuis et al., Leiden/Berkeley) • Unbounded FIFO buffers, blocking reads • Synchronous (Caspi/Pouzet, LIP6, 1996) no real-time • Data Flow Graphs • Ptolemy (E.S. Lee et al., Berkeley, 90s) • Many practical applications, but strictly asynchronous • Petri Nets • Diverse implementations • Hierarchical, colored • Timed Petri Nets (M. Diaz, LAAS Toulouse, 80s/90s) • Synchronous languages • Esterel, Lustre (SCADE), Lucid Synchrone, … • Etc…
Compute cycles: 15 times Speedup Done All cycles: 10-12 times Busy To do MB Level Parallelism # tiles 16 HW/SW approach for TLP extraction on tiled HW Approach 1:HW and compiler • Evaluated on the optimized 30+ multimedia benchmarks (Mediabench) • ZERO additional programming complexity relative to programming of a uniprocessor • Discovered high speedups for regular kernels/applications, but low performance potential for H.264 Approach 2: Programmer intervention • Enhanced for H.264 Super HD resolution (2K x 4K @ 30p) • Two weeks effort to parallelize the reference code • Validated and evaluated on simulator Ideal: 16 times
Take home message: challenges for streaming • Express operations in space and not only in time But also introducing the notion of time in the programming paradigm! • Extracting (regular?) parallelism from irregular applications • Parallelizing compilers …Holy Grail ! • Good results of autovectorization • Don’t let programmers explicitly write parallel programs • But provide a suitable input representation to express his knowledge of application • And let interact programmer <-> compiler • Approach of the ACOTES project with IBM, ST, INRIA, UPC, (Nokia) • New application description languages • Domain specific? Link to hardware modeling languages? • Graphical programming (1 drawing = 1000 words!) • Mindset to create in programmers/developers • Don’t teach only C/C++!
SOC RFID PAN… Post PC Era - Ambient intelligence Large number of “invisible” processors cooperating and interacting with the “real” world From Hugo de Man [De Man]
“Link” between computer and embedded systems… I have always wished that my computer would be as easy to use as my telephone My wish has come true I no longer know how to use my telephone Prof. Bjarne Strousrup (Creator of C++)
Questions ? Subject / Department / Author -
Acknowledgements: With thanks to Theo A.C.M. Claasen, and all my colleagues and others to whom I have “borrowed” slides…