410 likes | 493 Views
Multiprocessor System-on-Chip(MPSoC) Technology Wayne Wolf, Ahmed Amine Jerraya and Grant Martin. Presented by Santosh Ponnala. Brief Overview. Introduction Multiprocessors and the Evolution of MPSoCs How Applications Influence Architecture Architectures for Real-Time Low-Power Systems
E N D
Multiprocessor System-on-Chip(MPSoC) TechnologyWayne Wolf, Ahmed Amine Jerraya and Grant Martin Presented by Santosh Ponnala
Brief Overview • Introduction • Multiprocessors and the Evolution of MPSoCs • How Applications Influence Architecture • Architectures for Real-Time Low-Power Systems • CAD Challenges in MPSoCs • Conclusion
Introduction • What is a MPSoC? • Where are they used? • System Requirements? • Why MPSoC? • What is a Multiprocessor? • How is a MPSoC different from a Multiprocessor?
What is a Parallel Architecture • A large collection of processing elements that communicate and cooperate to solve large problems fast. [-- Almasi and Gottlieb] • “ collection of processing elements” • How many? How powerful each? Scalability? • “ that can communicate” • How do PEs communicate? (shared memory vs message passing) • Interconnection Networks (bus, crossbar, ..) Serial Computing Parallel Computing
Why Use Parallel Computing? • Main Reasons: • Save time and/or money • Solve larger Problems • Provide Concurrency • Limits to serial computing
VectorvsArrayProcessing Let n be the size of each vector. Then, the time to compute f (V1, V2) = k + (n-1), where k is the length of the pipe f. Array Processing In array processing, each of the operations f (v1j, v2j) with the components of the two vectors is carried out simultaneously, in one step.
EarlyMultiprocessors • CU = Control Unit , PE = Processing Element , PEM = PE Memory module. • The machine was not fully operational until 1975. Between that time and 1981 it was the world's fastest computer. • It performed Vector and Array operations in parallel. • Speed of integration tracks Moore’s law: doubling every 18-24 months. • Generic Model of Multiprocessors: • A collection of Computers ( cpu + memory) communicating over an interconnect network. [ Culler et al.] Architecture of the ILLIAC IV
Why did uniproccesor performance grow so fast? • ~half from circuit improvement (smaller transistors, faster clock, etc.) • ~ half from architecture/organization: • Instruction Level Parallelism (ILP) • Pipelining: RISC, CISC with RISC backend • Superscalar • Out of order execution • Memory hierarchy (Caches) • Exploiting spatial and temporal locality • Multiple cache levels
History of Multiprocessors • 80s – early 90s: prime time for parallel architecture research • A microprocessor cannot fit on a chip, so naturally need multiple chips (and processors) • 90s: at the low end, uniprocessor system’s speed grows much faster than parallel system’s speed • A microprocessor fits on a chip. So do branch predictor, multiple functional units, large caches, etc! • Microprocessor also exploits parallelism (pipelining, multiple issue, VLIW) – parallelisms originally invented for multiprocessors • 90s: emergence of distributed (vs. parallel) machines (Progress in network technologies:) • Network bandwidth grows faster than Moore’s law • Fast interconnection network getting cheap • Connects cheap uniprocessor systems into a large distributed machine • Network of Workstations, Clusters, GRID. • 00s: parallel architectures are back • Transistors per chip >> microproc transistors • Harder to get more performance from a uniprocessor • E.g. Intel Pentium D, Core Duo, AMD Dual Core, IBM Power5, Sun Niagara, etc.
History of MPSoCs 1.Lucent Daytona MPSoC • Designed for wireless base stations, in which identical signal processing was performed on a number of data channels. • Split transaction Bus. • Processing element is based on SPARC V8. • Reconfigurable L1 cache. SIMD Architecture
2. C-5 Network Processor • Application: Packet Processing in Networks. • Packets are handled by channel processors. • Each cluster has 4 processors. • Packet processors intercept individual IP data packets and process them using application software. • Executive Processor: RISC CPU • Operating Freq: 166MHz - 233MHz
3. Phillips Viper Nexperia • Application: Multimedia Processing. • Has two CPUs. • master: MIPS PR3940 • slave : Trimedia TM32 • Has three buses. • Memory controller for external DRAM interface and DMA units for each CPU. • Can execute many OS including, Windows CE, Linux, VxWorks. • CPUs share same resources and use semaphores to negotiate ownership of shared resources.
4. TI OMAP 5912 • Application: Cell phone Processor. • Designed to support 2.5G and 3G wireless applications. • In addition to basic voice services, it is intended for speech processing, location-based services, security, gaming, and multimedia. • Has two CPUs: an ARM9 and a TMS320C55x digital signal processor (DSP) • C55x DSP performs signal processing as slave. • ARM runs operating system, dispatches tasks to DSP. • SRAM capacity: 192 KB
5. STMicro Nomadik • Designed for mobile multimedia. • Accelerators built around MMDSP+ core: • One instruction per cycle. • 16- and 24-bit fixed-point, 32-bit floating-point. • Host Processor : ARM926EJ • Two programmable accelerators on the bus. • Video Accelerator is a heterogeneous MP Video accelerator Audio accelerator
Moore’s Law • A law of physics • A law of process technology • A law of micro-architecture • A law of psychology • Most of us are familiar with Moore’s Law growth of transistors • • Other characteristics appear to have reached a ceiling
Multiprocessors: Implementation Technology concerns(billion-transistor CMOS implementation technology) • Design Issues. • Transistor gate delay • Interconnect delay* • Exponential increase in processor clock rates • Result of these trends. • Design Complexity.
UltraSparc Niagra • 8 CPU Cores • Only a single floating point Unit • 4 DDR2 Busses • 4-way L2 Cache • Built in self-test • Operated at 1.4 GHz • Capable of processing up to 32 concurrent threads.
Comparing Alternative Multiprocessor Architectures Superscalar SMP • Logic, Wire and Design Complexity will increasingly favor CMP over Superscalar and SMT implementations. CMP Parallel vs Distributed Computers
Multi-nonsense • Multi-core was a solution to a performance problem • Hardware works sequentially • Make the hardware simple – thousands of cores • Do in parallel at a slower clock rate to save power • ILP is dead • Examine what is (rather than what can be) • Communication: off-chip hard, on-chip easy • Abstraction is a pure good • Programmers are all dumb and need to be protected • Thinking in parallel is hard
Performance Improvements • Computer Engineers improve performance through the reduction of C/I • I/P is the domain of CS – writing software • S/C is the domain of EE/VLSI – IC fabrication • • CPI or C/I is improved through getting more instructions done in each cycle • • This means doing work in parallel distributed across the functional units of the IC
How Applications Influence Architecture • Complex Applications • Nature of the computations • Eg. MPEG-2 encoder. • Memory bandwidth requirements of an encoder vary across the block diagram. MPEG-2 encoder • Standard based design • Many high-volume markets are standards-driven: • wireless • multimedia • networking. • Standard defines the basic I/O requirements. • Real time operation. • Low power/energy operation. • Standards committees often provide reference implementations ( very single threaded).
Platform based design • What is a Platform? A partial design: • for a particular type of system • includes embedded processor(s), may include embedded software • customizable to a customer’s requirements: • software • component changes • Why Platforms? Any given space has a limited number of good solutions to its basic problems. • A platform captures the good solutions to the important design challenges in that space. • A platform reuses architectures. • Standards encourage platform-based design.
Alternative to platforms • General-purpose architectures. • May require much more area to accomplish the same task. • Often much less energy-efficient. • Reconfigurable systems. • Good for pieces of the system, but tough to compete with software for miscellaneous tasks. Intel Xilinx
Platform vs. full-custom • Platform has many fewer degrees of freedom: • harder to differentiate • can analyze design characteristics. • Full-custom: • extremely long design cycles • may use less aggressive design styles if you can’t reuse some pieces. • Costs of platform-based design • Masks. • design of the platform + customization. • Design verification.
Platform based Design (reduces cost) • Divide system design into 2 phases • design a platform for a class of applications • adapt the platform for a particular product in that application space • Homogeneous MP vs Heterogeneous MP • Examples of platforms: • Data rate • Power and energy consumption • Buffering and Memory Management • Product Design -- S/W Driven (Customization) • Usefulness of platform depends largely on the quality and capabilities of the SDE.
Architectures for Real-time Low-Power Systems • Performance and Power efficiency • Benchmarks: high-performance data networking, voice recognition, video compression/ decompression, and other applications Power consumption trends for desktop processors from Austin et al. [Aus04] 2004 IEEE Computer Society
Architectures for Real-time Low-Power Systems (contd.) • Real-Time Performance • Homogeneous Architecture • Heterogeneous Architecture • Eg. Shared Memory MP • Software methods to eliminate conflicts. • Application Structure • Homogeneous vs Heterogeneous Architecture
CAD Challenges in MPSoCs • Configurable processors and instruction set synthesis. • CPU configuration ( tools that generate a HDL) • Coarse grained and fine grained instruction ext. • Eg. MIMOLA, LISA, Tensilica Xtensa. • Instruction set synthesis • 1% rule [ Holmer and Despain]
CAD Challenges in MPSoCs (contd.) 2. Encoding • Signal Encoding improves area & power consumption. • Eg. Code Compression [ Wolfe and Chanin (Huffman)] and bus encoding. • Data Compression (more complex) • Eg. Lempel- Ziv Compression (L3 - MM) • Bus-Invert Coding (Stan and Burleson)
CAD Challenges in MPSoCs (contd.) 3. Interconnect-driven design • Early SoCs were driven by design approach. • Interconnect choices are based on conventional bus concepts. • Bus? (Single set of wires shared among multiple devices) • Best known SoC buses : ARM AMBA, IBM CoreConnect. • Growth in complexity of SoCs (Communication Bottleneck) • Network on chip (NoC) • Use a hierarchical N/W with routers for data communication • Single shared Bus vs Multiple Communication Channels • Eg. Sonics SiliconBackplane (TDMA style interconnection n/w)
CAD Challenges in MPSoCs (contd.) 4. Memory system optimizations • Cache : everything ( placement, replacement, allocation and WB) is managed by hardware. • vs Scratchpad: everything is managed by software. • Servers, general purpose systems use caches. • Scratchpad provides predictability of hits/ misses. • Important for ensuring real time property. • Complexity increases with applications. • Worst case time is more tightly bound.
CAD Challenges in MPSoCs (contd.) 5. Hardware/ Software Codesign • Used to explore design space of heterogeneous MP • Cost estimation ( area, power & performance) 6. SDEs • SDEs for single processors ( commercial and open-source) • No comparable retargeting technology for multiprocessors • MPSoC development environments tend to be a collection of tools. ( no substantial connection) • Difficult to determine the true state of the system.
Conclusion • MPSoCs are an important chapter in the history of multiprocessing • System Designers like uniprocessors with sufficient computation power. • DSPs (Audio Processing) • Von Neumann architecture supports traditional software development tools • Computational power (Moore’s Law) vs low –power, low- cost, real time requirements.