230 likes | 705 Views
Introducing the MIPS32 ® 1004K ™ Coherent Processing System. Industry’s First Embedded Multi-Threaded Multiprocessor IP Core March 2010. MIPS32® 1004K™ Coherent Processing System Highlights. A licensable IP block for coherent multi-core applications:
E N D
Introducing the MIPS32® 1004K™Coherent Processing System Industry’s First Embedded Multi-Threaded Multiprocessor IP Core March 2010
MIPS32® 1004K™ Coherent Processing System Highlights • A licensable IP block for coherent multi-core applications: • Extends multi-core system performance via use of multi-threaded cores • Extracts maximum processing performance, given inherent throughput bottleneck between CPU and main memory • Same OS, same programming model, negligible cost adder • H/W and/or optimized S/W coherency for I/O peripherals • Configuration and scalability addressing a broad range of price/performance implementation points • Optimal product implementations • Broad use of standard IP
MIPS32 1004K CPS: Market Impact • Address device convergence – more complex tasks running in parallel • Applications – a sampling: • Entertainment Devices/Set Top Boxes • Sophisticated User Interfaces • Multimedia stream processing, audio processing without a DSP • File systems/Storage R/W, Secure content • Multiple network connectivity, client and server middleware • Residential Gateway • Higher data rates, increased security, data traffic prioritization • More integrated features (Single chip Router, NAS, MFP server, Media Hub) • MFP/OA Products • Fax/Print/Scan • SOHO/Enterprise class print server • Camera/Card reader/print • USB/Ethernet/WiFi/Bluetooth connectivity • Enabling Convergence: • Advanced User Interfaces (UIs) • More complex operating systems for concurrent operation • Full-featured SMP versions of Linux, Win CE, RTOSs • Improved silicon design flexibility • Add CPUs to scale performance to application and product line requirements
MIPS® 32-Bit Processor Core Families Broad range of synthesizable processors Optimized for low-power convergent consumer multimedia applications 1004K:Multi-threaded (34K), Multiprocessor (1-4 cores) Coherence Management Unit 1.3+ GHz (40nm), 1.5DMIPS/MHz/Core 1004K 74K:Superscalar 15-stage pipeline 1.1 GHz in 65nm (prod’n frequency) Up to 2.7GHz in 40nm 74K 34K:Multi-threading 24KE:DSP extensions 24K:8-stage pipeline 900 MHz (65nm) 34K 24KE 24K M14Kc microMIPS M14Kc 4KE:Cache, MMU M4K:MCU, Low Cost 4KSd:Security 4KE M14K M4K 4KS microMIPS, MIPS32 Reduced interrupt latency AHB, advanced debug
MIPS: Two Paths to Great Performance Use 74K™ core when: Use 1004K core when:
1004K Core 1004K Core 1004K Core VPE 1 VPE 1 VPE 1 VPE 0 VPE 0 VPE 0 I-Cache I-Cache I-Cache D-Cache D-Cache D-Cache Coherence Port Coherence Port Coherence Port 1004K Core VPE 1 VPE 0 I-Cache D-Cache Coherence Port Coherent OCP Coherent OCP Coherent OCP Coherent OCP 1004K Coherent Processing System (CPS) Global Interrupt Controller CPC OCP Debug/Trace I/O Coherence Unit OCP Coherence Manager 64-bit OCP mem I/F or 256-bit L2 I/F SOC-it® L2 Level 2 Cache Controller (optional) OCP 64-bit or 256-bit OCP Configuration Options System I/F and Main Memory
1004K Low Power Optimizations • Individual CPU clock gating (in GA version) • Individual CPU power down via core power controller (CPC) • CPUs can move out of, and back into, coherence domain • CPUs can be powered down, and back up, as needed • External inputs for dynamic voltage and frequency scaling (DVFS) of 1004K cluster • CAD Low Power Flow support • CAD and Synopsys support Highest performance Core power down 1004K cluster 1 2 3 4 CPC Freqsys Voltsys CM Lowest Power (Cluster level control)
Leading Performance/Power Efficiency Dual Core Implementations with common configurations
Broad Suite of Configurable Features • Two products of the 1004K CPS family • 1004Kc™ = CPS using base integer cores • 1004Kf™ = CPS using integer cores plus FPUs • Homogenous – all cores configured identically • Multi-core Coherence Manager (CM) • Foundation block for system coherency • Configurable for 1 to 4 single or multi-threaded cores • IO Coherence Unit (IOCU) provides (optional) hardware support for IO peripherals performing DMA transfers • Expanded CM:L2 interface with fractional clock rates • Global Interrupt Controller (GIC) (customer use optional) • Support for L2 cache – SOC-it® L2 cache controller (available as separate product) • L2 Interface supports an optimized 256-bit interface • To 1004K CPS and on System Interface (new MR2 feature) • System interface compatible with MIPS OCP-based cores
1004K Base Core Architecture VPE 0 VPE 1 (Optional) • Individual Processor Core Features • MIPS32 compliant – leverage existing S/W for 24K, 24KE, 34K • Virtual Processing Elements (VPEs) enable multi-threading option • Coherence • Duplicate D$ tags for background cache snooping • Many core configuration options: • Single or multi-threaded (1, 2 VPEs) • FPU (and CPU/FPU clock ratio) • Inclusion and sizing for TLB, caches, Scratchpad RAM • User defined instructions QoS Common Hardware Fetch, Decode, Execution Unit, Caches Coherent Ports OCP 3.0 Request/ Response Snoop
1004K Coherence Manager (CM) High-Level Architecture 1004K Coherence Manager OCP Slaves • Serializes CPU messages • Initiates intervention traffic • Controls memory requests • Routes Memory Response • Routes non-coherent requests Request Unit RQU CPU0-3 Request OCP IOCUs Memory Interface Unit MIU OCP Masters Intervention Unit IVU CPU0-3 Intervention OCP Memory OCP IOCUs OCP Slaves Response Unit RSU CPU0-3 Response OCP IOCUs
HW I/O Coherent Multi-core System Extra CM request port similar to CPU port IOCU mates IO and CPU buses • Breaks up bursts and unaligned accesses into cacheline/dword transactions • Per transaction attributes for max performance • Requests can be tagged to snoop L1+L2, L2 only, or neither 1004K CPS CPU I/O CPU I/O … … I/O Coh. Unit Coherent R/W from I/O R/W from I/O I/O Interconnect Coherence Manager R/W from CPU R/W from CPU OptionalL2 Main Memory Side benefit: Compared to SW IOC architecture there is one less level between CPUs and memory
Global Interrupt Controller (GIC) and ITC • CPU access to GIC through relocatable memory mapped address range • Can connect to Coherence Manager or elsewhere in system • Interrupt capabilities support: • System level and inter-processor interrupts • Routing of interrupts to particular core or VPE • Configurable # of system interrupts (up to 256) • ITC (Inter-Thread Communication) • Now supports communication between VPEs in different cores
Multi-Threading Theory – How it Works Multi-Threaded Processor Single-Threaded Processor • MT ASE - introduced in 34K™ core, now also in 1004K™ core • Adds multi-threaded enhancements in demanding multi-processor applications • Each CPU can switch to other thread while waiting • Utilizes the idle time as the processor waits: • To increase system performance by delivering more throughput • To reduce system cost by consuming application(s) typically done by another DSP or ASIC in the customers design • Guarantees real-time with QoS • Ability to allocate processing bandwidth to real-time tasks Process 2 Process 1 Process 1 Process 2 Process Swap 8 units 17 units As processor’s speed increases, more time is wasted waiting for the memory. Typically, half (or more) of the processor time is wasted.
Multi-Threaded Multiprocessing Makes Sense • Coherent MP Overhead • Overhead exists at system level in a coherent multiprocessor (MP) system • The performance benefits of additional cores in a coherent system far outweigh the overhead • Multi-threading (MT) • Optimizing each CPU’s pipeline utilization improves a MP system’s efficiency • MT can reduce the need for additional CPUs in a MP system • MT and MP = Complementary technologies • Help to solve the same problem: improve efficiency of the processor performance with a cost-effective memory system • Same programming model, running under SMP OS
Multi-Tasking - Scheduling into one Linux CPU Software Processes Process Process Process Process Process Process Process Process Linux Kernel Scheduler Linux CPU Pipeline Example:24K or 74K core Note: “ready” processes are marked in green
Multi-Tasking - Scheduling into multiple Linux CPUs Software Processes Process Process Process Process Process Process Process Process SMP Linux Kernel Scheduler Linux CPU Linux CPU Linux CPU Pipeline Pipeline Pipeline Example:1004K system 3 cores, 1 VPE/Core Note: “ready” processes are marked in green
Multi-Tasking - Scheduling into Multiple Virtual Linux CPUs Software Processes Process Process Process Process Process Process Process Process SMVP Linux Kernel Scheduler “Linux CPU” “Linux CPU” “Linux CPU” “Linux CPU” “Linux CPU” “Linux CPU” “Linux CPU” “Linux CPU” Pipeline Pipeline Pipeline Pipeline Example:1004K system 4 cores, 2 VPEs/Core Note: “ready” processes are marked in green
Multi-Threading + Multiprocessing Advantage • MT cost/benefit in 2 and 4 core CMP implementations: • Performance: ~25% increase over ST core on Coremark ~20% increase over ST core on MultiBench • Area: 7-8%, depending on configuration • Benefit: > 1:1 performance/cost ratio • Extra Performance - Multi-threading in a CMP • In TSMC 40GS LVt process = 2300 Coremark (14400 vs 12100) • MT benefits are clear in CMP, even without tuning • SMP Linux managed all software task distribution • MT leverages same programming model and OS as MP • From S/W perspective the extra performance came for “free” • EEMBC MultiBench code was not specifically tuned for a multi-threaded pipeline • Customers tuning their software for MT for applications have seen more dramatic gains in many cases
Latest Debug Capabilities • On-chip or off-chip PDtrace buffer support • Support: • Core access to on-chip trace buffer • PC trace disable • Trace perfcounters • Filtered data trace mode • Relocatable Debug vector • In DRAM, not probe, to improve performance • High-speed data transfer with probe • Buffered and bi-directional
Summary • 1004K Core Family • Successfully GA released to first customers CY08Q2 • First embedded multi-threaded multiprocessor licensable IP • Extending performance efficiency beyond traditional multi-core alternatives • Maximum flexibility in configuring SoC processing requirements: • # of cores, multi-threading, FPUs, TLB/cache/RAM sizes, custom instructions….. MIPS32® 1004K™ Coherent Processing System Licensee Optimization to System Requirements
Thank You! At the core of the user experience® MIPS, MIPS32, MIPS64, MIPS-Based, MIPS-Verified, MIPS Technologies logo are trademarks of MIPS Technologies, Inc. and registered in the U.S. Patent and Trademark Office. MIPS, MIPS32, MIPS64, MIPS-Based, MIPS Logo, MIPS Technologies Logo, CorExtend, Pro Series, M4K, 4KE, 4KEc, 24K, 24KE, 34K, 74K, 1004K, MIPS Navigator, and FS2 are trademarks or registered trademarks of MIPS Technologies, Inc. in the United States and other countries.