1.3k likes | 1.31k Views
Explore the comprehensive elements essential for implementing high performance systems, focusing on high speed, reduced swing logic, low power consumption, and advanced technologies like deep submicron and low voltage channel engineering. Learn effective design methodologies to minimize power consumption in digital circuits, including voltage regulation, optimal clocking strategies, and logic design considerations. Discover techniques such as reducing switching activity, optimizing transistor usage, and maximizing energy efficiency while enhancing system performance.
E N D
Low Power Multimedia Reconfigurable Platforms Young-Chul Kim Chonnam National Univ. Dept. of ECE, IT SoC Lab. http://soc.chonnam.ac.kr
High Performance System 구현을 위한 제반 요소 High Performance System High Speed High Density Reduced Swing Logic Deep Submicron Technology Low Power per Gate Low Voltage Channel Engineering Low Capacitance Low VT Advanced Technology
전력 소모에 대한 고찰 • Digital 회로에서 전력 소모의 구성 성분 • Dynamic power가 전력 소모에 있어 가장 큰 부분을 차지한다. • Library가 주어진 상태에서 설계자가 조절할 수 있는 요소는 activity, VDD, frequency, routing capacitance 네가지 이다.
전력 소모를 줄일 수 있는 설계 방법 • 공급 전압을 조절하는 방법 • IC 내에서 high speed가 필요한 곳에만 높은 전압을 사용한다. • 사용하지 않는 block에 대해서는 sleep mode로 전력 소모를 줄인다. • 동작 주파수를 낮추는 방법 • Parallel processing으로 같은 throughput을 얻으면서 동작 주파수는 낮춘다. 이로 인한 면적의 증가는 필연적이다. • 큰 clock buffer의 사용을 피한다. • Phase Locked Loop (PLL)을 사용하여 필요한 곳에만 주파수를 높여 사용한다.
전력 소모를 줄일 수 있는 설계 방법 • Parasitic capacitance를 줄이는 방법 • Critical node에 짧은 배선을 사용한다. • 3배 이상의 fan-out을 피한다. • 낮은 전압 사용시 배선의 폭을 줄인다. • 가능한 한 작은 크기의 transistor를 사용한다. • Switching Activity를 줄이는 방법 • Bit 수를 감소시킨다. • Dynamic 회로보다는 static 회로를 사용한다. • 전체 transistor 수를 줄인다. • 가장 active한 node는 internal node로 결정한다.
전력 소모를 줄일 수 있는 설계 방법 • Switching Activity를 줄이는 방법 • 각 node 에서 주파수와 capacitance의 곱의 합이 최소가 되도록 logic을 설계한다. 즉, switching activity가 통계적으로 최소가 되도록 한다. • Logic tree를 결정할 때, 입력 신호의 activity가 높을수록 VDD또는 ground에서 멀리 위치시킨다. • Activity가 큰 cell은 dynamic으로, activity가 작은 cell은 static으로 설계한다. • Data가 변하지 않는 flip-flop의 clock을 off 시킨다. • 항상 사용하지 않는 cell의 clock을 disable시킬 수 있도록 한다.
ERE Framework • ERE illustrate the performance-energy tradeoffs by concurrently considering the performance improvement, energy savings, and resource-efficiency of a system. • i=base configuration with 1 resource • j=new configuration with N resource • ERE=• (=fraction of the energy saved) ( =normalized efficiency) ={E(1, i)-E(N, j)}/ E(1, i) =S(N,j)/j•N S(N,j)=T(1,i)/T(N,j) ERE suggests 4 DSPs whereas EDP suggests 12DSPs without considering the efficiency
NoC (network on chip) U.C. Berkeley • 단일 반도체 칩 상에 통신망 구조를 이식 • OSI model에 의해서 전송 프로토콜을 정의 • DSP/microprocessor/Memory 등을 H/W-S/W co-design 이용 단일 칩 내에서 연결 • 코드 최적화 및 저전력 software IP 라이브러리 구축 • 모듈간 연결을 위한 버스 구조 • 구성 요소 • Region: 특수한 토폴로지/네트워크 구조를 허용하는 영역 • Backbone • Wapper : 전송되는 메시지를 적절한 형태로 변환, 복잡하다 • 복잡하고 대형 시스템에 적합
스위치 네트워크: CLICHE • OSI 모델을 데이터 전송 프로토콜로 사용 • 칩에 집적된 네트워크 (Network on Chip) • 패킷 데이터 전송 • 대형 시스템이 구성 요소 • 이종 구성 요소의 칩 레벨 집적에 유리하다.
Scalability Efficiency Utilisation Fault tolerance Result quality (accuracy) Responsiveness Materials Structural Licencing Functional Production Control Effort Time Risk Applicability Coupling Cohesion Configurability Modularity NoC 의 figure of Merit Computation Energy consumption Storage Communication Functionality Capacity Performance System Quality Implementation Complexity Cost Variability Development Volume Flexibility Modifiability Lifetime Usability Manufacturability Programmability
NoC기반의 응용 분야 Low Power communication systems High-perforrmance communication systems Baseband platform High-capacity communicationsystems Personal assistant Database platform Data collection systems BACKBONE Multimedia platform Entertainment devices PLATFORMS Virtual reality games SYSTEMS
NoC 설계 flow R. Marculescu
Structural layers of NOC System control, product behaviour Product Network management, allocation, operation modes Configuration Applications Resource management, diagnostics, applications Functions Execution control, functions Executables RTOS, code, HW configurations Hardware units Processors, memorires, configurable HW, logic Resources Resource types, buses, IO Regions Region types, switches, network interfaces Communication Channels and protocols
Application System/Session Transport Network Data link Physical Network protocol • Physical • 신호 전압, 타이밍, 버스 폭, 신호 동기 • Data link • 오류 검출 정정 • Arbitration of physical medium • Network • IP protocol • 데이터 라우트 • Transport • TCP 프로토콜 • End –to-end connection
NOC Platform development • Scaling problem • How big NOC is needed? What are the application area requirements? • Region definition problem • What kind of regions are needed? What kind of interfaces between regions? What are the capacity requirements for the regions? • Resource design problem • What is needed inside resources? Internal computation type and internal communication? • Application mapping flow problem • What kind of languages, models and tools must be supported? How to validate and test the final products?
NOC Application Development • Mapping problem • How to partition applications for NOC resources? How to allocate functionality effectively? Is the performance adequate? Is the resource usage in balance? • Optimisation problem • How to perform global optimisation of heterogenuous applications? How to define right optimisation targets? How to utilise application/resource type specific tools? • Validation problem • Are the contraints met? Are the communication bottlenecks or power consumption hot spots? How to simulate 10000 GIPS system? How to test all applications?
Network on Chip alternatives NOC = Network of computation and storage resources NOC parameters: Number of resources Types of resources GPU DSP Memory Configurable HW Coprocessors Any combination Communication capability
IFU mesh Smart Crossbar IFU mesh 스위치 네트워크 Srikanteswara • Stallion processor • Cross bar – circuit switching과 유사 • 패킷 데이터 전송 • 계층화된 전송 구조 Stallion device from Virginia Tech
Advantages in the Layered Architecture • Defines the methodology to design multimode radios using hardware paging • Provides the framework for building a flexible soft radio at the expense of the overhead for packetizing data. • Excellent hardware reusability • Build libraries of hardware functions much like software’s • Good data flow properties and simple interface between the processing layer modules.
Stream-based design Processing Processing Stream Packet Stream Packet Stream Packet Element 1 Element 2 Configuration Application Layer Software Pipeline Re- Constr. I/O Layer Interpret Processing Packet Pipeline Configuration Layer Packet Bypass Pipeline Processing Layer
Bus-Based vs. P2P Communication R. Marculescu Buses Interconnections become dominant in DSM Huge bandwidth requirements (tens of Gb/s for some applications) (buses are not scalable!) Expanding market of mobile and other low-power applications Increasing cooling costs (buses consume too much power!) P2P Communication Faster; no bus contention, no bus arbitration Low-power solution Can be independently optimized May need more wiring resources
System Inputs R. Marculescu A set of IPs: Hard IP (Width*length, provided by different IP providers) Soft IP (Size provided by synthesis or estimation) Communication Task Graph (CTG)
Target Platform R. Marculescu
MPEG-2 Video Encoder R. Marculescu
Energy Comparison R. Marculescu
Packet-Based On-Chip Communication: Regular Architecture R. Marculescu
Energy-Aware Mapping for Tile-based Architectures R. Marculescu Objective: minimize the total communication energy consumption Constraint: meet the communication performance constraints (specified by designer) For a 4X4 tile architecture, 16! mappings
Tile-based Architecture Platform R. Marculescu
Network-centric Power Management R. Marculescu • Ability to make better predictions about the future workloads • Network power management adds very few overhead packets to the overall communication stream between cores • Amount of energy wasted while the core is idle is reduced, as the local PM knows ahead of time that no requests are arriving in near future
NoC protocols must be tolerant to common faults R. Marculescu • Data upsets: Crosstalk, EMI • Buffer overflows • Node/link failures • Synchronization errors
Wires-Centric Design • Exploits logic structure to reduce wire loads • Enables use of advanced circuits • wire properties and crosstalk known early and well characterized • Gives a stable design • key wire loads don’t change with small logic changes
Wires dominate - power, area, delay • Problem - Contemporary tools leave wires as an afterthought • result is lack of structure, visibility, and control • Solution 1 - wires first design • route key wires, then place gates • Solution 2 - route packets, not wires • on-chip networks • global wires fixed before the design starts
Replace dedicated global wiring with a shared network On-Chip Interconnection Networks Dedicated wiring Network
Most Wires are Idle Most of the Time • Don’t dedicate wires to signals, share wires across multiple signals • Route packets not wires • Organize global wiring as an on-chip interconnection network • allows the wiring resource to be shared keeping wires busy most of the time • allows a single global interconnect to be re-used on multiple designs • makes global wiring regular and highly optimized
Power consumption of CMOS circuits P = · CL · f · Vdd2 + · ISC · tsc·f · Vdd + IDC · Vdd + ILEAK · Vdd Charging & discharging Crowbar current Static current Subthreshold leakage current
Vdd, power, and current trend 200 500 2.5 2.0 Voltage Power 1.5 1.0 0.5 0.0 Current Voltage Power per chip [W] VDD current [A] 0 0 1998 2002 2006 2010 2014 Year International Technology Roadmap for Semiconductors 1998 update
New Computing Platforms • SOC power efficiency more than 10GOPs/w • Higher On Chip System Integration: COTS: 100W, SOAC:10W (inter-chip capacitive loads, I/O buffers) • Speed & Performance: shorter interconnection,fewer drivers,faster devices,more efficient processing artchitectures • Mixed signal systems • Reuse of IP blocks • Multiprocessor, configurable computing • Domain-specific, combined memory-logic
Power-distribution in integrated PicoRadio (total: 100 mW) Jan M. Rabaey
Web browsing is slow with 802.11 PSM Son! Haven’t I told you to turn on power-saving mode. Batteries don’t grow on trees you know! • Users complain about performance degradation But dad! PerformanceSUCKS when I turn on power-saving mode! So what! When I was your age, I walked 2 miles through the snow to fetch my Web pages!
Hardware-software partitioning, System Power down Complexity, Concurrency, Locality, Algorithm Regularity, Data representation Parallelism, Pipelining, Signal correlations Architecture Instruction set selection, Data rep. Circuit/Logic Sizing, Logic Style, Logic Design Threshold Reduction, Scaling, Advanced packaging Technology SOI Level of Expected Saving Abstraction Algorithm 10 - 100 times 10 - 90% Architecture 20 - 40% Logic Level Layout Level 10 - 30% 10 - 30% Device Level Levels for Low Power Design
System Level Power Optimization • Algorithm selection / algorithm transformation • Identification of hot spots • Low Power data encoding • Quality of Service vs. Power • Low Power Memory mapping • Resource Sharing / Allocation
Flow • C/C++ Compilation • Program Execution • Building design representation • Loading profiling data • Setting constraints • Power estimation • Identification of Hot Spots
Power-hungry Applications • Signal Compression: HDTV Standard, ADPCM, Vector Quantization, H.263, 2-D motion estimation, MPEG-2 storage management • Digital Communications: Shaping Filters, Equalizers, Viterbi decoders, Reed-Solomon decoders
Clock Network Power Managements • 50% of the total power • FIR (massively pipelined circuit): video processing: edge detection voice-processing (data transmission like xDSL) Telephony: 50% (70%/30%) idle, 동시에 이야기하지 않음. with every clock cycle, data are loaded into the working register banks, even if there are no data changes.