250 likes | 393 Views
PROCESSOR ARCHITECTURES FOR MULTIMEDIA APPLICATIONS. Oguz Karacuka. What Is Multimedia Processing?. Desktop: – 3D graphics (games) – Speech recognition (voice input) – Video/audio decoding (mpeg-mp3 playback) Servers: – Video/audio encoding (video servers, IP telephony)
E N D
PROCESSOR ARCHITECTURES FOR MULTIMEDIA APPLICATIONS Oguz Karacuka
What Is Multimedia Processing? • Desktop: – 3D graphics (games) – Speech recognition (voice input) – Video/audio decoding (mpeg-mp3 playback) • Servers: – Video/audio encoding (video servers, IP telephony) – Digital libraries and media mining (video servers) – Computer animation, 3D modeling & rendering (movies) • Embedded: – 3D graphics (game consoles) – Video/audio decoding&encoding (set top boxes, PVR...) – Image processing (digital cameras) – Signal processing (cellular phones)
Characteristics Of Multimedia Apps. • Requirement for real-time response – “Incorrect” result often preferred to slow result – Unpredictability can be bad (e.g. dynamic execution) • Narrow data-types – Typical width of data in memory: 8 to 16 bits – Typical width of data during computation: 16 to 32 bits – 64-bit data types rarely needed – Fixed-point arithmetic often replaces floating-point • Fine-grain (data) parallelism – Identical operation applied on streams of input data – Branches have high predictability – High instruction locality in small loops or kernels
Characteristics Of Multimedia Apps.cont. • Coarse-grain parallelism – Most apps organized as a pipeline of functions – Multiple threads of execution can be used • Memory requirements – High bandwidth requirements but can tolerate highlatency – High spatial locality (predictable pattern) but lowtemporal locality – Cache bypassing and prefetching can be crucial
Examples of Media Functions • Matrix transpose/multiply(3D graphics) • DCT/FFT(Video, audio, communications) • Motion estimation(Video encoding, deinterlacing) • Gamma correction(3D graphics) • Haar transform(Media mining) • Median filter(Image processing) • Separable convolution(Image processing) • Viterbi decode(Communications, speech) • Bit packing(Communications, cryptography) • …
Approaches to Media Processing VLIW with SIMD extensions (aka mediaprocessors, Adapted Programmable Architectures) Asics/FPGA’s (Dedicated/Function Specific Architectures) Multimedia Processing DSP’s (Flexible Programmable Architectures) Vector Processors General-purpose processors with SIMD extensions coldfire: Dedicated multimedia processors are typically custom designed architectures intended to perform specific multimedia functions. These functions usually include video and audio compression and decompression, and in this case these processors are referred to as video codecs. In addition to support for compression, some advanced multimedia processors provide support for 2D and 3D graphics applications. Designs of dedicated multimedia processors range from fully custom architectures, referred to as function specific architectures, with minimal programmability, to fully programmable architectures. Furthermore, programmable architectures can be classified into flexible programmable architectures, which provide moderate to high flexibility, and adapted programmable architectures, which provide an increased efficiency and less flexibility [1]. The dedicated multimedia processors use a variety of architectural schemes from multiple functional units and a RISC or DSP (digital signal processor) core processors to multiple processor schemes. Furthermore, the latest dedicated processors use single-instruction-multiple-data (SIMD) and very-long-instruction-word (VLIW) architectures, as well as some hybrid schemes. These architectures are presented in Section 3. General-purpose (GP) processors provide support for multimedia by including multimedia instructions into the instruction set. Instead of performing specific multimedia functions (such as compression and 2D/3D graphics), GP processors provide instructions specifically created to support generic operations in video processing. For example, these instructions include support for 8-bit data types (pixels), efficient data addressing and I/O instructions, and even instructions to support motion estimation. The latest processors, such as MMX (Intel), VIS (Sun) and MAX-2 (HP), incorporate some types of SIMD architectures, which perform the same operation in parallel on multiple data elements.
Function Specific Architectures • Limited (if any) programmability • DSP or RISC core processor for main control • Special hardware accelerators for the DCT, quantization, entropy encoding, motionestimation... • High efficiency and speed: typically better compared to programmable architectures. • The siliconarea optimization achieved by function-specific architectures allows lower production cost.
Programmable Dedicated Architectures • Increased flexibility: enables the processing of different tasks under software control. • Higher cost for design andmanufacturing: additional hardware for program control is required. • Require software development for the application: parallelizationstrategies have to be applied
Flexible Programmable Architectures TI’sMultimedia Video Processor (MVP) TMS320C80 coldfire: The MVP combines a RISC master processor and four DSP processors in a crossbar-based SIMD shared-memory architecture, as shown. The master processor can be used for control, floating-point operations, audio processing, or 3D graphics transformations. Each DSP performs all the typical operations of a generalpurpose DSP and can also perform bit-field and multiple-pixel operations. Each DSP has multiple functional elements (multiplier, ALU, local registers, a barrel shifter, address generators, and a program-control flow unit), all controlled by very long 64-bit instruction words (VLIW concept). The RISC processor, DSP processors, and the memory modules are fully interconnected through the global crossbar network that can be switched at an instruction clock rate of 20 ns. A 50 MHz MVP executes more than 2 GOPS.
Adapted Programmable Architectures C-Cube’s VRP – VRP2 coldfire: The VRP2 processor consists of a 32-bit RISC processor and two special functional units for variablelengthcoding and motion estimation, as shown in the block diagram in Figure 7. Speciallydesigned instructions in the RISC processor provide an efficient implementation of the DCTand other video-related operations.
VLIW Advanced Architectures • Reduce the number of cycles per instruction required forexecution of highly complex and parallel algorithms • Multiple independentfunctional units that are directly controlled by long instruction words. • Unefficient use of silicon: requires a giant routing network of buses and crossbar switches. • All functional units share a common large register file • Code compaction is typically done by a special compiler, which can predictbranch outcomes by applying an algorithm known as trace scheduling • Can be combined with SIMD arch.for increased parallelism e.g. : Mitsubishi D30V and Philips Semiconductor’s TriMedia coldfire: The VLIW architectural model is used in the latest dedicated multimedia processors. A typicalVLIW architecture uses long instruction words with more than hundreds of bits in length. Theidea behind VLIW concept is to reduce the number of cycles per instruction required forexecution of highly complex and parallel algorithms by the use of multiple independentfunctional units that are directly controlled by long instruction words. Thisconcept isillustrated in Figure 10, where multiple functional units operate in parallel under control of along instruction. All functional units share a common large register file [11]. Different fields ofthe long instruction word contain opcodes to activate different functional units. Programswritten for conventional 32-bit instruction word computers must be compacted to fit the VLIWinstructions. This code compaction is typically done by a special compiler, which can predictbranch outcomes by applying an algorithm known as trace scheduling.
Philips TriMedia CPU64 Arch. • 5 slot VLIW architecture with a 64-bit word size; • 27 functional units, offering a choice of operation types • in each slot in the instruction any operation can be guarded to provide conditionalexecution without branching; • All functional units provide vector-style subword parallelismon byte, half-word, or word entities. • instruction set and functional units optimized withrespect to media processing; • a single multi-ported register file with bypass network,allowing 1-cycle latency operations; • 32 kB, 8-way instruction cache16 kB, 8-way, quasi-dual ported, data cache; • a variable-length (compressed) instruction set design. coldfire: The TriMedia CPU64 architecture is a 5-slot VLIWmachine, in principle launching a long instruction everyclock cycle. It has a uniform 64-bit wordsize through allfunctional units, the register file, load/store units, on-chiphighway and external memory. The 5 operations in a singleinstruction can in principle each read 2 register argumentsand write one register result every clock cycle. In addition,each operation can be guarded with an optional (4th) registerfor conditional execution without branch penalty.All functional units provide vector-style subword parallelismon byte, half-word, or word entities. This SIMDstyleoperation in each of the 5 slots in parallel allows for avery high media processing throughput. There is almost nosupport for arithmetic on 64-bit integers, 64-bit (doubleprecision) floating point numbers, or 64-bit address ranges,since this was not considered important for the intendedapplication area.With the exception of floating point divide and squareroot, all functional units are pipelined, allowing a restartevery cycle. The latencies vary from 1 (for operations likeadd, compare, bitand, bitshift, byteshuffle) to 4 (word multiplywith round). A register-file bypass allows an operationresult to be used as an argument for a next operation withouthaving to wait for registerfile storage and retrieval.
Multiple-instruction, multiple-data (MIMD) architectures • offer 10 to 100 times morethroughput than existing VLIW and SIMD architectures • Multipleinstructions are executed in parallel on multiple data: a control unit for each datapath. • asynchronous nature increases the complexity of software development.
SIMD Extensions to General Purp. Processors WHY ? • Performance – A 1.2GHz Athlon can do MPEG-4 encoding at 6.4fps – One 384Kbps W-CDMA channel requires 6.9 GOPS • Power consumption – A 1.2GHz Athlon consumes ~60W – Power consumption increases with clock frequency andcomplexity • Cost – A 1.2GHz Athlon costs ~$62 to manufacture and has a listprice of ~$600 (module) (year 2000) – Cost increases with complexity coldfire: The real-time multimedia processing on PCs and workstations is still handled by dedicatedmultimedia processors. However, the advanced GP processors provide an efficient support forcertain multimedia applications. These processors can provide software-only solutions formany multimedia functions, which may significantly reduce the cost of the system. GP processors apply the SIMD approach, described in previous section, by sharing their existing integer or floating-point data paths with a SIMD coprocessor. Many microprocessor instruction sets include instructions for accelerating multimedia applications such as DVD playback, speech recognition and 3D graphics. All leading processorvendors have recently designed GP processors that support multimedia, as shown in Figure 1. The main differences among these processors are in the way they reconfigure the internal register file structure to accommodate SIMD operations, and the multimedia instructions they choose to add.
SIMD Extensions to General Purp. Processors • Motivation – Low media-processing performance of GPPs – Cost and lack of flexibility of specialized ASICs forgraphics/video – Underutilized datapaths and registers • Basic idea: sub-word parallelism – The mismatch between wide data pathsand the relatively short data types found in multimediaapplications – Treat a 64-bit register as a vector of 2 32-bit or 4 16-bitor 8 8-bit values (short vectors) – Partition 64-bit datapaths to handle multiple narrowoperations in parallel • Initial constraints – No additional architecture state (registers) – No additional exceptions – Minimum area overhead
Intel’s MMX Example • targeted to accelerate multimedia and communications applications, especially on the Internet. • MMX system extends the basic integer instructions: add, subtract, multiply, compare, and shift into SIMD versions. • Added DCT / IDCT kernels • MPEG-1 video decompression speed up with MMX is about 80%,while some other applications, such as image filtering speed up to 370%.
Summary of SIMD Instructions • Integer arithmetic – Addition and subtraction with saturation – Fixed-point rounding modes for multiply and shift – Sum of absolute differences – Multiply-add, multiplication with reduction – Min, max • Floating-point arithmetic – Packed floating-point operations – Square root, reciprocal – Exception masks • Data communication – Merge, insert, extract – Pack, unpack (width conversion)
Summary of SIMD Instructions • Comparisons – Integer and FP packed comparison – Compare absolute values – Element masks and bit vectors • Memory – No new load-store instructions for short vector –No support for strides or indexing –Short vectors handled with 64b load and storeinstructions – Pack, unpack, shift, rotate, shuffle to handle alignment ofnarrow data-types within a wider one – Prefetch instructions for utilizing temporal locality
SIMD Ext. for GPP Summary • Narrow vector extensions for GPPs – 64b or 128b registers as vectors of 32b, 16b, and 8belements • Based on sub-word parallelism and partitioneddatapaths • Instructions – Packed fixed- and floating-point, multiply-add, reductions – Pack, unpack, permutations • 2x to 4x performance improvement over basearchitecture – Limited by memory bandwidth • Difficult to use (no compilers) • Overhead of handling alignment and datawidth adjustment • Optimized shared libraries – Written in assembly, distributed by vendor – Need well defined API for data format and use
SUMMARY • Computationally intensive multimedia functions, such as MPEG encoding,HDTV codecs, 3D processing, and virtual reality, will still require dedicated processors • We should expect that new generations of GP processors would devote more and more transistors to multimedia by investing some of the available chip real estate to support multimedia.