MPOC “Many Processors, One Chip”

MPOC“Many Processors, One Chip” presented by N. Özlem ÖZCAN

AGENDA • Introduction • Project Team • Design Aims • 4-stage Pipeline • Memory • Result

INTRODUCTION • Project at Hewlett Packard's Palo Alto Research Lab • Between years 1998 and 2001 • Stands for “Many processors, one chip”

INTRODUCTION • A single-chip community of identical high-speed RISC processors surrounding a large common storage area • Each processor has its own clock, cache and program counter • In order to reach min. power consumption, each processor is to be small and simple, but run very fast using that min. power requirements.

PROJECT TEAM • Stuart Siu, circuit design • Stephen Richardson, logic design and project management • Gary Vondran, Project Lead and logic design • Paul Keltcher, processor simulation, application development • Shankar Venkataraman, OS and application development • Krishnan Venkitakrishnan, logic design and bus architecture • Joseph Ku, circuit design • Manohar Prabhu and Ayodele Embry, interns

DESIGN AIMS 1) novel funding for microprocessor research; 2) introducing multiprocessing to the embedded market; 3) trading design complexity for coarse grain parallelism; 4) a novel 4-stage microprocessor pipeline; 5) using co-resident on-chip DRAM to supply chip multiprocessor memory needs.

4-STAGE PIPELINE • F-stage to fetch the instruction from the instruction cache, • D-stage to decode the instruction, • E-stage to calculate arithmetic results and/or a memory address, • M-stage during which the processor can access the data cache; • W-stage during which the processor writes results back to its register file.

4-STAGE PIPELINE Reasons for eliminating M-stage: • small first level caches can be accessed in a single cycle • simple base-plus-offset addressing scheme of the MIPS instruction set allowed addresses to be calculated in the second half of the D stage

4-STAGE PIPELINE / LOAD & STORE

4-STAGE PIPELINE / LOAD & STORE • The hit or miss signal for a data cache access does not appear until late in the E stage of the pipeline.

4-STAGE PIPELINE / LOAD & STORE • LOAD => The data has already been fetched from the cache, but the miss signal can simultaneously halt the pipeline and prevent the incorrect data from being written to the register until the correct data arrives

4-STAGE PIPELINE / LOAD & STORE • STORE => The miss signal arrives in time to prevent incorrect data from going into the cache until the tags can be updated, dirty data can be written out to memory if necessary, and the pipeline can be restarted.

4-STAGE PIPELINE / LOAD & STORE • STORE THEN LOAD => Single bubble!

4-STAGE PIPELINE / LOAD & STORE • ALU THEN LOAD => Single bubble!

4-STAGE PIPELINE / LOAD & STORE • LOAD THEN A DEPENDENT INST => NO bubble!

4-STAGE PIPELINE / BRANCH • The instruction following a branch is always executed, regardless of the direction eventually taken by a branch. • This extra instruction allows the pipeline to calculate the target of the branch, so that it can speculatively fetch the target in the following cycle.

4-STAGE PIPELINE / BRANCH • NO penalty for taken branch: • 1 cycle penalty for non-taken branch:

4-STAGE PIPELINE / BRANCH

MEMORY • In MPOC’s original plan, 1MB to 4MB of DRAM is placed on the same silicon die with the four processors.

MEMORY Ways of managing local memory: • Cache organized as set of data lines • Unified physical address space that includes both local and remote memory

RESULT • Final status of the design:

REFERENCE • Stephen Richardson. MPOC: A Chip Multiprocessor for Embedded Systems, Internet Systems and Storage Laboratory, HP Laboratories Palo Alto

MPOC “Many Processors, One Chip”

MPOC “Many Processors, One Chip”

Presentation Transcript

Vector Processors

ChIP-chip Data, Model and Analysis

Chromatin Immunoprecipitation ChIP Workshop

Single-Chip Multi-Processors (CMP)

Vector Processors

Implementation of Embedded OS

PC Hardware Basic Guide

Multicore / Manycore Processors

Application-Specific Signatures for Transactional Memory in Soft Processors

Quadrisection-Based Task Mapping on Many -Core Processors for Energy-Efficient On-Chip Communication

Multicore: Commercial Processors

Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors

Performance Scalability on Embedded Many-Core Processors

System on a Programmable Chip (System on a Reprogrammable Chip)

ARM processors

Onchip Interconnect Exploration for Multicore Processors Utilizing FPGAs

Advanced Pipelining

ChIP-seq

Dynamic Reduction of Voltage Margins by Leveraging On-chip ECC in Itanium II Processors

SPREE Tutorial

Parallel Computing on Manycore GPUs

I519 Introduction to Bioinformatics, Fall, 2012