1.06k likes | 1.45k Views
Low Power System Level Design Methodologies. Young-Chul Kim Chonnam National Univ. Dept. of ECE, IT SoC Lab. http://soc.chonnam.ac.kr/~yckim. Contents. Introduction to System Level Design Hardware and Software Co-design Re-configurable Processors Other Low Power System Level Designs.
E N D
Low Power System Level Design Methodologies Young-Chul Kim Chonnam National Univ. Dept. of ECE, IT SoC Lab. http://soc.chonnam.ac.kr/~yckim
Contents • Introduction to System Level Design • Hardware and Software Co-design • Re-configurable Processors • Other Low Power System Level Designs
Introduction to SOC • SOC will bridge the gap b/w s/w and their implementation • in novel, energy-efficient silicon architecture. • In SOC design, chips are assembled at IP block level (design reusable) and IP interfaces rather than gate level • SOC specs are coming from ICT system engineers rather • than RTL descriptions.
Common Fabric for IP Blocks • Soft IP blocks are portable, but not as predictable as hard IP. • Hard IP blocks are very predictable since a specific physical implementation can be characterized, but are hard to port since are often tied to a specific process. • Common fabric is required for both portability and predictability. • Wide availability: Cell Based Array, metal programmable architecture that provides the performance of a standard cell and is optimized for synthesis.
Four main applications • Set-top box: Mobile multimedia system, base station for the home local-area network. • Digital PCTV: concurrent use of TV,3D graphics, and Internet services • Set-top box LAN service: Wireless home-networks, multi-user wireless LAN • Navigation system:steer and control traffic and/or goods-transportation
Silicon in 2010 Die Area: 2.5x2.5 cm Voltage: 0.6 V Technology: 0.07 m
Portable systems long battery life light weight small form factor IC priority list power dissipation cost performance Technology direction Reduced voltage/power designs based on mature high performance IC technology, high integration to minimize size, cost, power, and speed Why Lower Power
Power(W) Alpha 21164 Alpha 21264 50 P III 500 45 P II 300 40 35 Alpha21064 200 30 25 P6 166 20 P5 66 15 P-PC604 133 10 i486 DX2 66 P-PC601 50 i486 DX25 5 i386 DX 16 i486 DX4 100 i286 i486 DX 50 P-PC750 400 1980 1985 1990 1995 2000 year Microprocessor Power Dissipation
Power-hungry Applications • Signal Compression: HDTV Standard, ADPCM, Vector Quantization, H.263, 2-D motion estimation, MPEG-2 storage management • Digital Communications: Shaping Filters, Equalizers, Viterbi decoders, Reed-Solomon decoders
New Computing Platforms • SOC power efficiency more than 10GOPs/w • Higher On Chip System Integration: COTS: 100W, SOAC:10W (inter-chip capacitive loads, I/O buffers) • Speed & Performance: shorter interconnection,fewer drivers,faster devices,more efficient processing artchitectures • Mixed signal systems • Reuse of IP blocks • Multiprocessor, configurable computing • Domain-specific, combined memory-logic
Physical gap • Timing closure problem: layout-driven logic and RT-level synthesis • Energy efficiency requires locality of computation and storage: match for stream-based data processing of speech,images, and multimedia-system packets. • Next generation SOC designers must bridge the architectural gap b/w system specification and energy-efficient IP-based architectures, while CAE vendors and IP providers will bridge the physical gap.
Three Factors affecting Energy • Reducing waste by Hardware Simplification: redundant h/w extraction, Locality of reference,Demand-driven / Data-driven computation,Application-specific processing,Preservation of data correlations, Distributed processing • All in one Approach(SOC): I/O pin and buffer reduction • Voltage Reducible Hardwares • 2-D pipelining (systolic arrays) • SIMD:Parallel Processing:useful for data w/ parallel structure • VLIW: Approach- flexible
Example2: IBM’s PowerPC Lower Power Architecture • Optimum Supply Voltage through Hardware Parallel, Pipelining ,Parallel instruction execution • 603e executes five instruction in parallel (IU, FPU, BPU, LSU, SRU) • FPU is pipelined so a multiply-add instruction can be issued every clock cycle • Low power 3.3-volt design • Use small complex instruction with smaller instruction length • IBM’s PowerPC 603e is RISC • Superscalar: CPI < 1 • 603e issues as many as three instructions per cycle • Low Power Management • 603e provides four software controllable power-saving modes. • Copper Processor with SOI • IBM’s Blue Logic ASIC :New design reduces of power by a factor of 10 times
Power-Down Techniques • Lowering the voltage along with the clock actually alters the energy-per-operation of the microprocessor, reducing the energy required to perform a fixed amount of work
Voltage vs Delay • Use Variable Voltage Scaling or Scheduling for Real-time Processing • Use architecture optimization to compensate for slower operation, e.g., Parallel Processing and Pipelining for concurrent increasing and critical path reducing.
Why Copper Processor? • Motivation: Aluminum resists the flow of electricity as wires are made thinner and narrower. • Performance: 40% speed-up • Cost: 30% less expensive • Power: Less power from batteries • Chip Size: 60% smaller than Aluminum chip
Silicon-on-Insulator • How Does SOI Reduce Capacitance ? • Eliminated junction capacitance by using SOI (similar to glass) is placed between the impuritis and the silicon substrate • high performance, low power, low soft error
SOC Co-Design Challenges • Current systems are complex and heterogenous Contain many different types of components • Half of the chip can be filled with 200 low-power, RISC-like processors (ASIP) interconnected by field-programmable buses, embedded in 20Mbytes of distributed DRAM and flash memory, Another Half: ASIC • Computational power will not result from multi-GHz clocking but from parallelism, with below 200 MHz. This will greatly simplify the design for correct timing, testability, and signal integrity.
Configurability • One-M gate reconfigurable, one-M gate hardwired logic. • 50GIPS for programmable components or 500 GIPS for dedicated hardwares • Reduce design risks for which NRE costs will become dominant • 1 V with the watt range
Bridging the architectural gap • Product reliability: design at a level far above the RT level, with reuse factors in excess of 100 • Trade-off: 100MOPs/watt (microprocessor) 100GOPs/watt (hardwired) Reconf. Computing with a large number of computing nodes and a very restricted instruction set (Pleiades)
Three Co-Design Approaches • IFIP International Conference FORTE/PSTV’98, Nov.’98 N.S. Voros et.al, “Hardware -software co-design of embedded systems using multiple formalisms for application development” • ASIP co-design: starts with an application, builds a specific programmable processor and translates the application into software code. H/w and s/w partitioning includes the instruction set design. • H/w s/w synchronous system co-design: s/w processor as a master controller, and a set of h/w accelerators as co-processors. Vulcan,Codes,Tosca,Cosyma • H/w s/w for distributed systems: mapping of a set of communication processors onto a set of interconnected processors. Behavioral decomposition, process allocation and communication transformation. Coware(powerful),Siera (reuse),Ptolemy (DSP)
Mixing H/W and S/W • Argument: Mixed hardware/ software systems represent the best of both worlds. High performance, flexibility, design reuse, etc. • Counterpoint: From a design standpoint, it is the worst of both worlds • Simulation: Problems of verification, and test become harder • Interface: Too many tools, too many interactions, too much heterogeneity • Hardware/ software partitioning is “AI- complete”!
Partitioning • Performance Requirements • 몇몇의 Function들은 Hardware로의 구현이 더 용이 • 반복적으로 사용되는 Block • Parallel하게 구성되어 있는 Block • Modifiability • Software로 구성된 Block은 변형이 용이 • Implementation Cost • Hardware로 구성된 Block은 공유해서 사용이 가능 • Scheduling • 각각 HW와 SW로 분리된 Block들을 정해진 constraints들에 맞출 수 있도록 scheduling • SW Operation은 순차적으로 scheduling되어야 한다 • Data와 Control의 의존성만 없다면 SW와 HW는 Concurrent하게 scheduling
Low power partitioning approach • Different HW resources are invoked according to the instruction executed at a specific point in time • During the execution of the add op., ALU and register are used, but Multiplier is in idle state. • Non-active resources will still consume energy since the according circuit continue to switch • Calculate wasting energy • Adding application specific core and partial running Whenever one core performing, all the other cores are shut down
Partitioning Process - Derives a graph G - operation and connection - Decomposition of G into a set of clusters - cluster : set of operation - Calculate bus-traffic energy - Pre-select clusters with constraints - Set the number of resources - List scheduling - Test the utilization rate (ASIC or µP) - the utilization rate of µP is supported by SW estimation tool
Design Flow - Max 94% energy saving and in most case even reduced execution time - 16k sell overhead
Interface • Interface Block의 필요성 • Hardware와 Software Block간의 Data 전달 • 효율적인 Interface Block 을 구성해야만 HW/SW Block간의 Overhead를 줄일 수 있다 • Interface 방법 • Shared Memory • FIFO • Handshaking protocol
Logical Bus Architecture • System Bus Signals • address, data, control signals • address space consists of the memory space & I/O space • memory space : memory of the SW component • I/O space : ports within SW & registers in other HW • Port Signals • These are specialized signals capable of directly interfacing between SW & HW component • Interrupt Signals • When SW & HW components have completed an operation, or when an error condition is detected
Co-Simulation • Co-simulation의 필요성 • HW part와 SW part를 함께 Simulation을 할 수 있게 해 줌으로써 구성된 System의 결과를 예측할 수 있다 • System Performance를 예측하여 Synthesis 이전에 지정된 Spec.에 맞도록 System을 재설계할 수 있도록 해 준다 • HW/SW Partitioning을 위한 각 Sub-block의 특성을 예측해 준다 • Co-simulation Tool • Ptolemy • COSSAP • POLIS
Approach - vada Lab. SKKU - Software oriented design - Dark block : Hardware - Interface : Control signal gen. - Partitioned in terms of speed cost - Change from SW to HW 1. Implementation speed 2. Parallel architecture
Low Power CDMA Searcher Project at SKKU 과제명: IS-95기반의 DS/CDMA 시스템 Co-design 기법을 이용한 저전력 설계 개발기간: 1999.3.1 - 2000.2:28 (12개월) 개발 목적 및 방법: CDMA 단말기에 사용하기위한 MSM (Mobile Station Modem) 칩의 탐색자 (Searcher Engine)에 대한 RTL수준 저전력 설계 구현. 동작 주파수 : 12.5MHz Data flow graph를 사용하여 rescheduling, pre-computation 및 strength reduction, Synchronous Accumulator를 이용한 저전력 설, area와 power를 각각 최대 67.68%, 41.35% 감소 시킴. H/W and S/W Co-design 기법 적용 • San Kim and Jun-Dong Cho, “Low Power CDMA Searcher”, CAD and VLSI Workshop, May. 1999. • Inki Hwang, San Kim and Jun-Dong Cho, “CDMA Searcher Co-Design”, ASIC Workshop, Sep. 1999.
Application- Specific Instruction Processor • Processor architecture tailored not just for application domain (e. g., DSP, microcontrollers), but for specific sets of applications (e. g., audio, engine control) • ASIP characteristics • Greater design cost (processor + compiler) • Higher performance, lower power than commercial cores, more flexibility than ASIC
ASIP Design • Given a set of applications, determine micro architecture of ASIP (i. e., configuration of functional units in datapaths, instruction set) • To accurately evaluate performance of processor on a given application need to compile the application program onto the processor datapath and simulate object code. • The micro architecture of the processor is a design parameter!
Compiler Optimizations • Machine independent optimizations • Parallelizing transformations, Common sub-expression elimination, Constant Propagation, Strength reduction, Loop Invariant Code motion • Machine dependent optimizations • Loop unrolling and software pipelining • Static allocation (non- recursive procedure calls) • Storage layout (arrays, scalars) • Optimization of mode setting instructions • Instruction selection, scheduling, and register allocation
Cross-Disciplinary nature • Software for low power:loop transformation leads to much higher temporal and spatial locality of data. • Code size becomes an important objective Software will eventually become a part of the chip • Behavior-platform-compiler codesign: codesigned with C++ or JAVA, describing their h/w and s/w implementation. • Multidisciplinary system thinking is required for future designs (e.g., Eindhoven Embedded Systems Institutehttp://www.eesi.tue.nl/english)
VLSI Signal Processing Design Methodology • pipelining, parallel processing, retiming, folding, unfolding, look-ahead, relaxed look-ahead, and approximate filtering • bit-serial, bit-parallel and digit-serial architectures, carry save architecture • redundant and residue systems • Viterbi decoder, motion compensation, 2D-filtering, and data transmission systems
Low Power DSP • DO-LOOPDominant • VSELP Vocoder : 83.4 % • 2D 8x8 DCT : 98.3 % • LPC computation : 98.0 % DO-LOOPPower Minimization ==> DSPPower Minimization VSELP : Vector Sum Excited Linear Prediction LPC : Linear Prediction Coding
Loop unrolling • The technique of loop unrolling replicates the body of a loop some number of times (unrolling factor u) and then iterates by step u instead of step 1. This transformation reduces the loop overhead, increases the instruction parallelism and improves register, data cache or TLB locality. Loop overhead is cut in half because two iterations are performed in each iteration. If array elements are assigned to registers, register locality is improved because A(i) and A(i +1) are used twice in the loop body. Instruction parallelism is increased because the second assignment can be performed while the results of the first are being stored and the loop variables are being updated.