290 likes | 511 Views
The Kill Rule for Multicore. Anant Agarwal MIT and Tilera Corp. Multicore is Moving Fast. Corollary of Moore’s Law Number of cores will double every 18 months. What must change to enable this growth?. Multicore Drivers Suggest Three Directions. Diminishing returns Smaller structures
E N D
The Kill Rule for Multicore Anant Agarwal MIT and Tilera Corp.
Multicore is Moving Fast Corollary of Moore’s Law Number of cores will double every 18 months What must change to enable this growth?
Multicore Drivers Suggest Three Directions • Diminishing returns • Smaller structures • Power efficiency • Smaller structures • Slower clocks, voltage scaling • Wire delay • Distributed structures • Multicore programming 1. How we size core resources 2. How we connect the cores 3. How programming will evolve
Core: IPC=1 Area=1 Core: IPC=1.2 Area=1.3 1 IPC = 1 + m x latency Instructions per cycle Core: IPC=1 Area=1 OR Cache Cache Cache Cache Cache Cache Cache Processor Processor Processor Processor Processor Processor Processor Cache Cache Cache 4 cores Small Cache 3 cores Big Cache Processor Processor Processor Chip: IPC=3 Chip: IPC=4 Chip: IPC=3.6 How We Size Core Resources 3 cores Small Cache
“KILL Rule” for Multicore Kill If Less than Linear A resource in a core must be increased in area only if the core’s performance improvement is at least proportional to the core’s area increase Put another way, increase resource size only if for every 1% increase in core area there is at least a 1% increase in core performance Leads to power-efficient multicore design
Multicore Multicore Multicore 14% increase 24% increase 2% increase 4% increase 16% increase Core Core Core Area=1.07 IPC=0.25 Area=1.03 IPC=0.17 Area=1 IPC=0.04 4KB 512B 2KB 325% increase 47% increase 7% increase 3% increase 7% increase Cache 93 cores 100 cores 97 cores Chip IPC=4 Chip IPC=17 Chip IPC=23 Multicore Multicore Multicore Core Core Area=1.63 IPC=0.32 Core Area=1.31 IPC=0.31 Area=1.15 IPC=0.29 16KB 8KB 32KB Chip IPC=25 87 cores 76 cores 61 cores Chip IPC=24 Chip IPC=19 Kill Rule for Cache Size Using Video Codec
Well Beyond Diminishing Returns Madison Itanium2 Cache System L3 Cache Photo courtesy Intel Corp.
1 IPC = 1 + m x latency (cycles) 1 IPC = = 2 1 + 0.5% x 200 1 IPC = = 2 1 + 2.0% x 50 4GHz Miss penalty in cycles 4x smaller Miss rate can be 4x more 1GHz Implies that cache can be 16x smaller! Slower Clocks Suggest Even Smaller Caches Insight: Maintain constant instructions per cycle (IPC)
Multicore Drivers Suggest Three Directions • Diminishing returns • Smaller structures • Power efficiency • Smaller structures • Slower clocks, voltage scaling • Wire delay • Distributed structures • Multicore programming 1. How we size core resources KILL rule suggests smaller caches for multicore If the clock is slower by x, for constant IPC, the cache can be smaller by x2 KILL rule applies to all multicore resources Issue width: 2-way is probably ideal [Simplefit, TPDS 7/2001] Cache sizes and number of memory hierarchy levels 2. How we connect the cores 3. How programming will evolve
p p p Mesh Multicore Bus Multicore c c c p p p p p p p p p c c c c c c c c c s s s s s s BUS s Ring Multicore p p p p s c c c c s s s s s Interconnect Options Packet routing through switches
p p p Mesh Multicore Bus Multicore c c c p p p p p p p p p c c c c c c c c c s s s s s s BUS s Ring Multicore p p p p s c c c c s s s s s Bisection Bandwidth is Important
p p p Mesh Multicore Bus Multicore c c c p p p p p p p p p c c c c c c c c c s s s s s s BUS s Ring Multicore p p p p s c c c c s s s s s Bandwidth increases as we add more cores Concept of Bisection Bandwidth
Meshes are Power Efficient %Energy Savings(Mesh vs. Bus) Number of Processors Benchmarks
Meshes Offer Simple Layout Example:MIT’s Raw Multicore • 16 cores • Demonstrated in 2002 • 0.18 micron • 425 MHz • IBM SA27E standard cell • 6.8 GOPS www.cag.csail.mit.edu/raw
Tiled c c c c p p p p switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch switch c c c c p p p p c c c c p p p p Tiled Multicore satisfies one additional property p p c c c c Fully Distributed, No Centralized Resources c c p p p p BUS L2 Cache Multicore • Single chip • Multiple processing units • Multiple, independent threads of control, or program counters – MIMD
Mesh based tiled multicore Multicore Drivers Suggest Three Directions • Diminishing returns • Smaller structures • Power efficiency • Smaller structures • Slower clocks, voltage scaling • Wire delay • Distributed structures • Multicore programming 1. How we size core resources 2. How we connect the cores 3. How programming will evolve
Multicore Programming Challenge • “Multicore programming is hard”. Why? • New • Misunderstood- some sequential programs are harder • Current tools are where VLSI design tools where in the mid 80’s • Standards are needed (tools, ecosystems) • This problem will be solved soon. Why? • Multicore is here to stay • Intel webinar: “Think parallel or perish” • Opportunity to create the API foundations • The incentives are there
Old Approaches Fall Short • Pthreads • Intel webinar likens it to the assembly of parallel programming • Data races are hard to analyze • No encapsulation or modularity • But evolutionary, and OK in the interim • DMA with external shared memory • DSP programmers favor DMA • Explicit copying from global shared memory to local store • Wastes pin bandwidth and energy • But, evolutionary, simple, modular and small core memory footprint • MPI • Province of HPC users • Based on sending explicit messages between private memories • High overheads and large core memory footprint But, there is a big new idea staring us in the face
mem mem mem mem Inspiration from ASICs: Streaming mem Stream of data over a hardware FIFO • Streaming is energy efficient and fast • Concept familiar and well developed in hardware design and simulation languages
Interconnect Channel Port1 Port2 Receiver Process Sender Process Streaming is Familiar – Like Sockets • Basis of networking and internet software • Familiar & popular • Modular & scalable • Conceptually simple • Each process can use existing sequential code
Core-to-Core Data Transfer Cheaper than Memory Access • Energy • 32b network transfer over 1mm channel 3pJ • 32KB cache read 50pJ • External access 200pJ • Latency • Reg to reg 5 cycles (RAW) • Cache to cache 50 cycle • DRAM access 200 cycle Data based on 90nm process node
Client-server Broadcast-reduce Streaming Supports Many Models • Pipeline Not great for Blackboard style Shared state But then, there is no one size fits all
Interconnect Channel Port2 Port1 Receiver Process Sender Process connect(<send_proc, Port1>, <receive_proc, Port2>) Put(Port1, Data) Put(Port1, Data) Put(Port1, Data) Put(Port1, Data) Put(Port1, Data) Get(Port2, Data) Get(Port2, Data) Get(Port2, Data) Get(Port2, Data) Get(Port2, Data) Multicore Streaming Can be Way Faster than Sockets • No fundamental overheads for • Unreliable communication • High latency buffering • Hardware heterogeneity • OS heterogeneity • Infrequent setup • Common-case operations are fast and power efficient • Low memory footprint MCA’s CAPI standard
CAPI’s Stream Implementation 1 Process A (E.g., FIR1) Process B (E.g., FIR2) Core 1 Core 2 Multicore Chip I/O register-mapped hardware FIFOs in SOCs
CAPI’s Stream Implementation 2 Cache Cache Process A (E.g., FIR) Process B (E.g., FIR) Core 1 Core 2 On-chip Interconnect Multicore Chip On-chip cache to cache transfers over on-chip interconnect in general multicores
Conclusions • Multicore is here to stay • Evolve core and interconnect • Create multicore programming standards – users are ready • Multicore success requires • Reduction in core cache size • Adoption of mesh based on-chip interconnect • Use of a stream based programming API • Successful solutions will offer evolutionary transition path