320 likes | 523 Views
A Flexible Multi-Core Platform For Multi-Standard Video Applications. MPSoC 2009 Savannah, Georgia, USA. Soo-Ik Chae Center for SoC Design Technology Seoul National University. Content. Motivation Proposed multi-core platform architecture RISC cluster Hardware operating system kernel
E N D
A Flexible Multi-Core Platform For Multi-Standard Video Applications MPSoC 2009Savannah, Georgia, USA Soo-Ik ChaeCenter for SoC Design TechnologySeoul National University
Content • Motivation • Proposed multi-core platform architecture • RISC cluster • Hardware operating system kernel • Computation coprocessor architecture • Communication architecture with two separated networks • Design flow for application mapping • Experimental result • H.264/AVC 720p high profile decoder implementations • Future work
High-performance Video Systems Huge computation load • 60 GOPS to decode 1080p 30 fps • Dedicated H/W blocks (for high-end applications) Multiple standards/ New standards • MPEG2/4, H.264, DivX, VC-1, etc • Software with RISC, DSP, SIMD processors Should satisfy all of these CONFLICTING requirements!! Embedded in mobile devices • PMPs, Smart Phones, etc • Area and energy efficiencies are critical Flexible high-performance platform Large data transfers and memories • At least 96MB for 1080p decoders • Application-specific optimized communication and memory architectures
Proposed Multi-core Platform Architecture • An array of RISC clusters with coprocessors connected through two separated networks: control and data • Each RISC consists of up to 4 cores, shared I$ and D$, HOSK, coprocessors.
H/W based Task queue management + {priority+RR}-based task scheduling Channel access with a single co-processor instruction No cache-coherency problem Dynamic thread allocation + Pre-emptive multithreading (Priority- or Round Robin) Fast context switching in 4 ~ 17 cycles On-chip/Off-chip memory-based Context memory H/W-based mutex/semaphore Thread migration without compulsory cache misses Thread suspend or wake-up without software intervention Use larger SRAMs No cache fragmentation No system services in each core + Shared multiplier unit A Multi-threading RISC Cluster Multithreading Scheduling Context Switching Load Balancing Communication Synchronization Coherent Shared Memory Message Passing (Channel Access) Implementation Area (Complexity) Area (Complexity) Scalability (# of threads, # of cores) The number of cores in a cluster is limited due to cache sharing.
32-bit bus: 17 cycles 64-bit bus: 9 cycles 544-bit bus: 4 cycles Context switching order R15 R14 R13 … Pre-fetch or Save contexts in background! SDRAM or SRAM SDRAM or SRAM Task Scheduling & Semaphore Control Hardware Operating System Kernel (HOSK) • Main controller: receive service requests and control other blocks • Context manager: pre-fetch or save contexts in background • Thread manager: schedule tasks and control semaphores
RISC cluster RISC RISC RISC RISC Core 0 Core 1 Core 2 Core 3 ( Control ) ( Data ) ( Data ) ( Data ) command queue Arbiter command queue manager Local memory T 0 T 1 Tn thread pool Computation coprocessor Computation Coprocessors • Implemented for computation-intensive part of the video algorithms that cannot be run in a RISC cores. A pool of software threads General coprocessor interface Command queues to issue nonblocking coprocessor commands A pool of hardware threads Local memory is accessed byboth RISC cores and the computation coprocessors Coprocessor task manager selects an available hardware thread for an outstanding coprocessor command
Communication Network Architecture • Among RISC clusters • Two separated communication networks • control network: smaller data size, and synchronization information • based on conventional message passing • employ point-to-point hardware FIFO • provide a new path to transfer data • data network: larger data size • based on remote DMA operations, and bus-based style-like • employ memory (local or global) and hardware FIFO • handle high-rate data transfers for stream-based applications
Control Network: point-to-point FIFO based FIFO group • Each control transaction is initiated • by a control core with clusterID and fifoID Fully programmable connectivity Two-level distributed identification for FIFOs A control core can issue a command to the communication coprocessor in a single cycle for a control transaction
Data Communication Network Local data between two RISC clusters is exchanged through a shared local memory. Platform provides nC2 local data links Streaming data is stored in either a local memory or a global memory, which depends on the size of the data.
Global Memory Global Memory ( Streaming Data ) ( I / D Cache Data ) Memory Controller Memory Controller Global Data Network 1 DMA Controller P P r Multimedia Address o s Translator ( MAT ) s e c Data o Request P P r Recombination p Queue o Unit ( DRU ) c I - $ D - $ Global Data Network 2 Global Data Communication with a DMAC Two global data network for streaming data and I/D cache data can be either unified or separated, which depends on configuration of the memory controllers A small buffer is used between the DMA controller and a RISC cluster for DMA operations A centralized DMA controller performs address translation, DMA request queue Management, and data arrangement so that data cores are free from tasks related to data transfers
Design Flow for Application Mapping • video specification • area, power • operating frequency • number of clusters Starting with an application model and a platform model with constraints application profiling cluster partitioning • function partitioning & clustering communication mapping • configurable network • SystemC simulation in TLM • Multithreading TLM modeling & function profiling HW/ SW thread partitioning & mapping • Code generation (for RISC clusters) • RTL coding or generation (for coprocessors) performance estimation • Core #, cache sizing for each cluster • Sizing local memories verification • FPGA prototyping
Partitioning into clusters According the profiling results for a reference software, the application is first partitioned into grouped functions Each grouped function is mapped into a RISC cluster. Assumptions: RISC clusters with 4 cores @ 200MHz utilization rate=0.7 Upper MIPS bound for a 4-core cluster=560MIPS
Cluster Partitioning Example: an H.264/AVC CIF decoder is mapped into 4 RISC clusters 231 MIPS 259 MIPS 113 MIPS 01011000 Entropy Inverse Reconstructi 01101010 Decoding Quantization on 01010101 10010111 H . 264 bitstream 45 MIPS Current Intra 16 x 16 Prediction Neighbor Reference Pixels MUX 356 MIPS output 1087 MIPS Deblocking Frame Inter Filter N - 1 Prediction Multi Reference Frames RISC cluster
Cluster Partitioning Example: A H.264/AVC 720p decoder is mapped into 6 RISC clusters RISC cluster
Communication Mapping 1. identify control and data flows among the clusters 2. Map each control flow into a specific FIFO in a FIFO group 3. Map a data flow for streaming into a local data network or the global data network according to the size of its bandwidth requirement 4. Map data flows for I/D cache into the global memory
Example 1: Control Network Mapping for an H.264.AVC CIF high-profile decoder • transaction and size
Example 1: Data Network Mapping for an H.264.AVC CIF high-profile decoder • transaction and size
Example 2: Control Network Mapping for an H.264.AVC 720p high-profile decoder transaction and size
Example 2: Data Network Mapping for an H.264.AVC 720p high-profile decoder transaction and size
HW/SW Thread Partitioning & Mapping • For each RISC cluster cores coprocessors 1. Profile the required MIPS of each thread from TLM modeling 2. Select # of RISC cores and HW threads in the coprocessor 3. Allocate the threads to the cores or the coprocessor in the cluster 4. Back to step 2 if the result is not good enough
~480 MIPS for intra prediction in the 720p decoder Upper bound for a 4-core cluster: 560 MIPS Example: Thread Partitioning & Mapping for Intra-prediction (1) Map all threads to SW Thread-level parallelism is limited due to dependency among the threads, which limits core utilization
Dependency and intra-prediction order in a MB Example: Thread Partitioning & Mapping for Intra-prediction (2) 0 1 4 5 2 3 6 7 8 9 12 13 10 11 14 15 dependency Intra-prediction ordering 4x4 luma intra prediction for luma samples • Core utilization: limited because of limited parallelism (2) • Reducing cores from 4 to 3
Example: Thread Partitioning & Mapping for Inter-prediction Inter prediction case in the 720p decoder Upper bound for a 4-core cluster: 560 MIPS One of several possible SW-HW partitions is selected.
A Software-Centric Solution For H.264/AVC 720p High-Profile Decoder
Complexity of 720p High-profile Decoder • Logic gate count and memory usage • Synthesis conditions • 0.18-um CMOS technology • 200MHz for RISC clusters and 100MHz for others
Communication Network 21.6 MB/sec 310.2 MB/sec 196 MB/sec 415.63 MB/sec
Core Utilization (@200MHz) ED Cluster (3, 2) ITQ Cluster (4, 7) INTRA Cluster (3, 0) INTER Cluster (4, 0) RECON Cluster (1, 0) DF Cluster (1, 0) (thread number, context switching number per MB)
Design Space Exploration Seven mappings of an H.264 720p decoder With the same networks for control and data communication software-centric hardware-centric
Future Works More codec implementations H.264/AVC 720-p high-profile encoder VC-1 720p advanced-profile decoder Flexible coprocessors: Coarse-grained reconfigurable architecture (CGRA) RISC cluster RISC RISC RISC RISC Core 0 Core 1 Core 2 Core 3 ( Control ) ( Data ) ( Data ) ( Data ) command queue Arbiter command queue manager Local memory T 0 T 1 Tn thread pool Computation coprocessor