800 likes | 969 Views
General Presentation on IMEC’s Thematic Design Activities Ivo Bolsens, Hugo De Man + 125 researchers bolsens@imec.be. IMEC organization. CEO: Gilbert Declerck Divisions DESICS: design technology Ivo Bolsens SPT: process technology Luc van den Hove
E N D
General Presentation on IMEC’s Thematic Design Activities Ivo Bolsens, Hugo De Man + 125 researchers bolsens@imec.be
IMEC organization • CEO: Gilbert Declerck • Divisions • DESICS: design technologyIvo Bolsens • SPT: process technologyLuc van den Hove • STDI: silicon techn. & device integrationHerman Maes • MCP: microsystems & packagingRobert Mertens • INVOMEC: trainingEtienne Bourdeaud’hui
Mission • Design of • Architectures, Methods and Tools • for the Implementation of • Multimedia • Internet Terminals
How • Study requirements of embedded IT systems • Identify and solve RELEVANT design challenges • build application demonstrators • Work out systematic design methods and supporting tools • build tools for real-life design support • Develop re-usable, parameterized, white-box IP • Train and educate • industry • university
Measures of success • scientific impact • cooperation with universities in complementary fields • international network of cooperation with most of the important industrial performers in our field • portfolio of protected intellectual property • transfer of technologies to existing companies • creation of new spin-off companies • attracting foreign investments in the field of microelectronics and ICT • turn-over of well trained researchers to industry
Bridge the gap between systems and silicon Systems Heaven a JAVA, CORBA, JINI 5 million lines of VHDL 0.1µm =1/300 Hair 200 M Transistors Power/ cm**2 ~ V**3/l**3 T_intercon ~ r *l**2/l**2 Physics Hell
155Mb/s WLAN 5GHz 10Gop/s <1 Watt Reconfigurable Access Terminal Intelligent Home W W W MPEG 4 >100 Gop/s 5 Gtr/s 5 Watt
DESICS organization • DIMA: design of integrated multimedia applications • MICS: multimedia image compression systems • EMSYS: embedded systems design • SEMP: system exploration for memory & power • DISTA: design of integrated systems for telecom applications • MIRA: mixed signal & RF applications • WISE: wireless systems • DBATE: digital broadband terminals
Configurable Home Terminal Head End Homegateway router storage basestation IPv6 HFC IP Home network ServiceServer InternethomeAppliance modem CSL User premises
RPC Middleware Middleware packets OS OS signals Hardware Hardware 250 W,700 MHz,128 MByte 2 W,30 MHz,256 Kbyte PLUG&PLAY VCR Embedded connectivity Distributed Application
Challenges Reconfigurable Software Agent Agent Agent VM TCP/IP RTOS user data Digital CTL MON software agents Front End Tx/Rx DSP CFIL/CB To I/O parameters & synchro Hardware Analog
Challenges in Dynamic Reconfiguration • Run-time FPGA management • dynamical creation and deletion of HW processes • dynamical creation of the related HW/SW interfaces • dynamical extension of the instruction set • downloading of FPGA configuration for additional instruction • Fast HW compilation • Novel FPGA architectures optimized for partial runtime configuration • Performance Estimation (dynamic, configuration time)
Networked re-configurable computing Application Layer (Java applet +FPGA bitstream) FPGA Middleware Layer FPGA API FPGA Controller Real-TimeOperating System Java Native Interface Virtual Bus Hardware Platform Native Device Driver Local Bus Software Hardware
The first demonstrator: FPGA based NetCam InternetClient Ibis CMOSsensor ATMELEEPROM Netscape HTTP GIF Engine Reconfiguration FPGA request IP layers TCP/IP layers image 10 Base T 10 Base T network • first FPGA-based “thin-server” Internet Appliance (vs. Dedicated, Linux or uC based) • low power (FPGA ~ 0.7 W) • throughput scales up to 80 Mb/s
World’s first 80 MB/sec WLAN technology Base station 155Mb/s multi-user rx antenna diversity wired backbone Multi-path fading • Orthogonal Frequency Division Multiplexing (OFDM) • Turbo-coding • Spatial Division Multiple Access (SDMA) • Hiperlan-2/ IEEE 802.11 compatible
Single-package transceiver CMOS: IF and digital circuitry BiCMOS:RF circuitry MEMS: switches, varactor, resonators MCM: interconnect inductors, capacitors, resistors, filters, baluns Antenna
Multimedia : MPEG-4 member SCtee - Diversity : 3D, Facial and Body Animation, Video - Scalability : time, space, SNR - Interactivity : behaviour = f (input bits, user)
Focus • Graceful degradation, QOS • Encode once/ decode everywhere • Reduces the terminal cost (“soft” conformance with pathological cases) • Man-Machine Interface : Facial Animation • Real-time SOFTWARE video-coding of CIF images • Application Specific Processor for Wavelet coding Demo
Nowadays implementations: small images (QCIF: 176x144) decoding only not real-time several W Software specification: more than 200 000 lines C hundreds of files written by approx. 80 teams Wanted features: large images (TV) encoding and decoding real-time 100 mW (mobile) Challenges Multimedia : MPEG 4 JPEG2000 Several orders of magnitude in performance and power dissipation need to be gained Huge requirements: > 2 GOP/s > 6 GB/s > 10 MB storage Drastic reduction of design complexity required
World’s first MPEG-4 compliant silicon Max 30 fps CIF (352x288) Scalable architecture
Algorithms + Data Structures Architecture ARM RAM ROM IP1 IP2 Processor architecture ROM custom logic micro processor ROM MMU RAM DSP RAM C/C++ system refinement + exploration Data mngnt Concurrency mngnt Platform constraint Platform integration
Deeply embedded system Interfaces Dedicated logic • mP core • Dedicated logic • accelerator synthesis • multi-DSP core • retargetable ASIP compiler • Memory/MMU • Interfaces • system integration • Analog phone book keypad intfc phonebook RAM & ROM DMA S/P control protocol Frontier Coware Demod and sync Target Viterbi Equal. voice recognition speech quality enhancement A de-intl & decoder RPE-LTP speech decoder digital down conv D Multi-DSP core All of this fits in one, cheap, package
SoC++ Deeply embedded system mP core Memory/MMU System protocol • mP core • system layer compiler • Dedicated logic • multi-DSP core • memory/MMU • dynamic + static mem mngnt + addr expr. • Interfaces • Analog • A/D + RF Data phone book keypad intfc phonebook RAM & ROM DMA S/P control protocol Demod and sync Viterbi Equal. voice recognition speech quality enhancement Mixed Signal A de-intl & decoder RPE-LTP speech decoder digital down conv D Analog All of this fits in one, cheap, package
Tipsy, Matisse-TCM Matisse-DMM, Atomium/Acropolis, Adopt Ocapi-2, SoCos Fast Current challenges and solutions • System Specification and System-level Refinement with Exploration Support (algorithm design level, concurrent task level, system timing simulation) • Data Transfer and Storage Exploration for Massive Real Time Data Manipulation (dynamic memory mngntstatic transfer and storage, address generation) • Co-Design for Heterogenous Implementation Paradigms (refinement from unified HW/SW model,RTOS modeling, complete system simulation) • RF front-end exploration (fast mixed-signal co-simulation, chip-package co-design, noise coupling)
[ 106 T/chip ] 250 200 150 100 50 1995 1998 2001 2004 2007 [ SIA roadmap ] SoC or …---… (S.O.S.) • Design productivity gap grows ! • Complexity increase 40 % per year • Design productivity increase 15 %per year
System-level design • Solution • Paradigm shift • Higher abstraction level • Executable specs • Object-oriented design • Multi-paradigm modeling • Behavioral IP re-use • Incremental refinement to RT-HDL (HW) and C/C++ (SW)
OUT 622 Mb/s • Processes • Dynamic and concurrent processes • Global/local control • Little arithmetic/ logic processing ISR time datain out Packet Record IN 622 Mb/s FIFO Routing Record • Complex data sets • Large and irregular dynamically allocated data • Huge memory accesses routingreply data out OUT 200 accesses 622 Mb/s Stringent real-time constraints 53 cycles Network layer protocols (ATM, IP, …) Multi-media algorithms with dynamic character (MPEG4, MPEG7) Wireless and wired terminals (Internet, WLAN, ADSL, …) E.g.: System design issues in IT-Application domain Embedded system
Concurrent OO spec Task2 Task1 Task concurrency mgmt SW/HW co-design Task3 Memory organ. Unified model HW DSP uProc Partition Refine/compile HW-Ctrl uCtrl System control SW design flow HW design flow Transform Task schedule Allocate/assign Global concurrency management design flow for dynamic concurrent tasks with data-dominated behaviour Dynamic memory mgmt Physical memory mgmt Address optimization
1 2 3 Task conc. Extraction/trafo Task/thread scheduling 1 2 2 Proc Array-processor allocation Task to processor assignment Virtual Virtual Inter-task interface refinement Proc1 Proc2 TCM steps aim at removing the bottlenecks for better performance Optimized system specification Task1 Task2 Inter-task DTSE Task concurrency mngnt Task3 Task-level system architecture
The gray box approach focuses on the most relevant TCM issues High Level Specification Black-box TCG 1% Improved Gray-box <10% task concurrency extraction & improvement Initial gray-box TCG 10% Reduce complexity Create freedom Initial TCG 50% Simplify the model White-box TCG 100% C++ Specification
concurrency extraction/ improvement Task level DTSE static scheduling (partial ordering) dynamic scheduling grey-box model specification Task Level DTSE and TCM
Results on IM1 player Cost x x Time-Budget (MA cycle budget)
ARM Processor 2 ARM Processor 1 The 2-processor approach (scheduling + assignment) Task1 Task2 Taskn Vdd=1V Vdd=3.3V
Comparison of scheduling the original and transformed graphs original Transformed
Combination of static and dynamic scheduler Static Scheduling Static Scheduling Dynamic Scheduling 1 3 2 A B 1 A B 3 2 • Static scheduling: done at compiling time, exploring all the optimization possibility • Dynamic scheduling: done at run time, providing flexibility and dynamic control at low cost
Dynamic Scheduling result total energy 20% 24% 32% 32% 39% node number in timer threads Two Proc.(vlow = 1V, vhigh = 5V) One Proc.(v = 5V)
System requirements Abstract functionality Real-time constraints Target platform constr. Implementation Final hardware Appl. software OS services optimized for application SoC refinement and exploration R E Q U I R E M E N T SoC appl. + timing Application implementation (HW/SW) Memory mgmt constr. Process mgmtconstr. R E A L Memory mgmt impl. (HW/SW) Process mgmt impl. (HW/SW) Final platform (Silicon) Target platform
Memory mgmt Dynamic memory alloc / free (C) new / delete (C++) abstract data type refinement virtual memory mgmt Static memory platform-independent code transformations real-time cost-optimal physical memory organisation Address optimisation Process mgmt Task level concurrency mgmt (platform indep.) transformations static/dynamic scheduling resource allocation Instruction-level concurrency mgmt refinement from unified HW/SW model RTOS modeling/simulation including timing traditional HW/SW co-design and compilers Refinement and exploration
Virtual prototype Soft implementationusing host OS and host hardware Implementation Target hardware OS services optimized for application Refinement - OCAPI / MATISSE SoC appl. + arch. Application implementation (HW/SW) V I R T U A L Memory mgmt Process mgmt R E A L Memory mgmt impl. (HW/SW) Process mgmt impl. (HW/SW) OSAPI Target HW (Silicon) Host HW (HP/PC)
Unified Modeling and Refinement of HW and SW OCAPI-xlC++ Class Lib Flexible Primitives express High LevelSystem Model • Concurrency • Communication • Interface design/reuse unified HW/SW model Built-in Code Generators create RefinedModel • VHDL/Verilog/C • Testbenches
Code Generation HDL/SystemC C System Link & Interface Synthesis SoC++ design flow C++ System Model C++ HW SW OSAPI FSMD
Tipsy Ocapi Code Generation HDL/SystemC C Coware System Link & Interface Synthesis System Model C++ HW SW OSAPI FSMD
Binary Tree (BT) key data Concurrent OO spec Free Blocks key key data data Sub-pool per size Task concurrency mgmt SW/HW co-design processor Abstract Memory mem mem mem Data Allocation Types Assignment ASU ASU controller Virtual SW design flow HW design flow Memory Memory Mgmt Mgmt Unit Global data management design flow for dynamic concurrent tasks with data-dominated behaviour Dynamic memory mgmt Physical memory mgmt Address optimization
Data Management Flow Abstract Data Type (ADT) Refinement ADT ConcreteData types Dynamic Memory Mngnt. Virtual memory mgmt (VMM) Refinement VirtualMemorySegments Physical memory mgmt(PMM) Refinement PhysicalMemories Physical Memory Mngnt.
Binary Tree (BT) key 4 4 data 10 10 3 3 10 10 Array (); Array * Routing_Table; Binary_Tree (); Linked_List (); Linked_List * Routing_Table; Binary_Tree * Routing_Table; key key 2 2 10 data data 10 1 1 10 10 0 10 Array (AR) Power function Area function data data data data Linked List (LL) 0 10 key key key key data data data data Matisse: ADT refinement ATM_cell * Data_In; Association_Table* Routing_Table; Routing_Table = newAssociation_Table(); Data_In = new ATM_cell(); if ( Routing_Table->Lookup(Data_In) ) ... Impl. alternatives
104 103 102 101 100 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 ADT refinement results • Select best DT impl. for each ADT LL(A) LL(B) PA(B) LL(A) PA(B) BT(A) PA(A) Power cost function AR(B) Different data types
VM size for ATM MUX in network 1 PA(9) PA(5) PA(9) PA(9) PA(5) 32 PA(5) PA(5) PA(5) PA(5) PA(5) 32 32 PA(5) PA(5) 32 PA(5) PA(5) AR(4) PA(9) PA(5) AR(4) AR(4) AR(4) AR(4) 256 AR(4) AR(4) AR(4) 256 256 256 AR(4) AR(4) AR(4) AR(4) 1 VMS Size = 133 mm2 Power = 110 mW 2 VMS Size = 137 mm2 Power = 68 mW 2 VMS Size = 137 mm2 Power = 49 mW 3 VMS Size = 137 mm2 Power = 37 mW
µProc: 60%/year. CPU Processor-Memory Performance Gap:(grows 50% / year) DRAM: 7%/year DRAM Memory = CPU Performance Bottleneck Performance 1000 100 “Moore’s Law” 10 1 1980 1985 1990 1995 2000 Time [Patterson]
Client MainMemory LocalLatch LocalLatch LocalSelect LocalSelect bank1 bankN 128 - 1024bit bus Client CacheandBankcomb. GlobalBankSelectControl ctrl data addr Wide word Burst mode Data-transfer and data-storage bottlenecks: SDRAM access
Client MainMemory MainMemory Data-paths 16kBN-port SRAM 1MB1/2-portSRAM regf 256 MB (S)DRAM L1 cache Processors L2 cache Many cache missesPage Loading Data-transfer and data-storage bottlenecks: cache misses