490 likes | 567 Views
CACM July 2012. Talk: Mark D. Hill, Wisconsin @ Cornell University, 10/2012. Executive Summary. Today chips provide shared memory w/ HW coherence as low-level support for OS & application SW As #cores per chip scales? Some argue HW coherence gone due to growing overheads
E N D
CACM July 2012 Talk: Mark D. Hill, Wisconsin@ Cornell University, 10/2012
Executive Summary • Today chips provide shared memory w/ HW coherence as low-level support for OS & application SW • As #cores per chip scales? • Some argue HW coherence gone due to growing overheads • We argue it’s stays by managing overheads • Develop scalable on-chip coherence proof-of-concept • Inclusive caches first • Exact tracking of sharers & replacements (key to analysis) • Larger systems need to use hierarchy (clusters) • Overheads similar to today’s Compatibility of on-chipHW coherence is here to stay • Let’s spend programmer sanity on parallelism,not lost compatibility! 2
Outline Motivation & Coherence Background Scalability Challenges • Communication • Storage • Enforcing Inclusion • Latency • Energy Extension to Non-Inclusive Shared Caches Criticisms & Summary 3
Academics Criticize HW Coherence • Choi et al. [DeNovo]: • Directory…coherence…extremely complex & inefficient .... Directory … incurring significant storage and invalidation traffic overhead. • Kelm et al. [Cohesion]: • A software-managed coherence protocol ... avoids .. directories and duplicate tags , & implementing & verifying … less traffic ... 4
Industry Eschews HW Coherence • Intel 48-Core IA-32 Message-Passing Processor … SW protocols … to eliminate the communication & HW overhead • IBM Cell processor … the greatest opportunities for increased application performance, is the existence of the local store memory and the fact that software must manage this memory BUT… 5
Source: AvinashSodani "Race to Exascale: Challenges and Opportunities,“ Micro 2011.
Define “Coherence as Scalable” • Define a coherent system as scalablewhenthe cost of providing coherence grows (at most) slowly as core count increases • Our Focus • YES: coherence • NO: Any scalable system also requires scalable HW (interconnects, memories) and SW (OS, middleware, apps) • Method • Identify each overhead & show it can grow slowly • Expect more cores • Moore Law’s provide more transistors • Power-efficiency improvements (w/o Dennard Scaling) • Experts disagree on how many core possible
Caches & Coherence • Cache— fast, hidden memory—to reduce • Latency: average memory access time • Bandwidth: interconnect traffic • Energy: cache misses cost more energy • Caches hidden (from software) • Naturally for single core system • Via Coherence Protocol for multicore • Maintain coherence invariant • For a given (memory) block at a give time either • Modified (M): A single core can read & write • Shared (S): Zero or more cores can read, but not write
Baseline Multicore Chip Core 1 Core 2 Core C Block in private cache state tag block data Private cache Private cache Private cache • Intel Core i7 like • C = 16 Cores (not 8) • Private L1/L2 Caches • Shared Last-Level Cache (LLC) • 64B blocks w/ ~8B tag • HW coherence pervasive in general-purpose multicore chips: AMD, ARM, IBM, Intel, Sun (Oracle) ~2 bits ~64 bits ~512 bits Interconnection network Block in shared cache tracking bits state tag block data ~C bits ~2 bits ~64 bits ~512 bits 9
Baseline Chip Coherence Core 1 Core 2 Core C Block in private cache state tag block data Private cache Private cache Private cache • 2B per 64+8B L2 block to track L1 copies • Inclusive L2 (w/ recall messages on LLC evictions) ~2 bits ~64 bits ~512 bits Interconnection network Block in shared cache tracking bits state tag block data ~C bits ~2 bits ~64 bits ~512 bits 10
Coherence Example Setup Core 0 Core 1 Core 2 Core 3 • Block A in no private caches: state Invalid (I) • Block B in no private caches: state Invalid (I) Private cache Private cache Private cache Private cache Interconnection network Bank 0 Bank 1 Bank 3 Bank 2 A: {0000} I … B: {0000} I … 11
Coherence Example 1/4 Core 0 Core 1 Core 2 Core 3 Write A • Block A at Core 0 exclusive read-write: Modified(M) Private cache Private cache Private cache Private cache A: M, … Interconnection network Bank 0 Bank 1 Bank 3 Bank 2 A: {1000} M … {0000} I … B: {0000} I … 12
Coherence Example 2/4 Core 0 Read B Core 1 Read B Core 2 Core 3 • Block B at Cores 1+2 shared read-only: Shared (S) Private cache Private cache Private cache Private cache A: M, … B: B: S, … S, … Interconnection network Bank 0 Bank 1 Bank 3 Bank 2 A: {1000} M … B: {0000} I … {0110} S … {0100} S … 13
Coherence Example 3/4 Write A Core 0 Core 1 Core 2 Core 3 • Block A moved from Core 0 to 3 (still M) Private cache Private cache Private cache Private cache A: A: M, … M, … B: B: S, … S, … Interconnection network Bank 0 Bank 1 Bank 3 Bank 2 A: {0001} M … {1000} M … B: {0110} S … 14
Coherence Example 4/4 Write B Core 0 Core 1 Core 2 Core 3 • Block B moved from Cores1+2 (S) to Core 1 (M) Private cache Private cache Private cache Private cache A: B: M, … M, … B: B: S, … S, … Interconnection network Bank 0 Bank 1 Bank 3 Bank 2 A: {0001} M … B: {0110} S … {1000} M … 15
Outline Motivation & Coherence Background Scalability Challenges • Communication: Extra bookkeeping messages (longer section) • Storage: Extra bookkeeping storage • Enforcing Inclusion: Extra recall messages (subtle) • Latency: Indirection on some requests • Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Criticisms & Summary 17
1. Communication: (a) No Sharing, Dirty Core 1 Core 2 Core C Private cache Private cache Private cache Interconnection network Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data • W/o coherence: RequestDataData(writeback) • W/ coherence: RequestDataData(writeback)Ack • Overhead = 8/(8+72+72) = 5% (independent of #cores!) 18
1. Communication: (b) No Sharing, Clean Core 1 Core 2 Core C Private cache Private cache Private cache Interconnection network Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data • W/o coherence: RequestData0 • W/ coherence: RequestData(Evict)Ack • Overhead = 16/(8+72) = 10-20% (independent of #cores!) 19
1. Communication: (c) Sharing, Read Core 1 Core 2 Core C Private cache Private cache Private cache Interconnection network Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data • To memory: RequestData • To one other core: RequestForwardData(Cleanup) • Charge 1-2 Control messages (independent of #cores!) 20
1. Communication: (d) Sharing, Write Core 1 Core 2 Core C Private cache Private cache Private cache Interconnection network Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data • If Shared at C other cores • Request{Data,C Invalidations + C Acks}(Cleanup) • Needed since most directory protocols send invalidations to caches that have & sometimes do not have copies • Not Scalable 21
1. Communication: Extra Invalidations Core 1 Core 2 Core C Private cache Private cache Private cache Interconnection network Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data {1|2 3|4 .. C-1|C} { 1 0 .. 0 } { 0 0 .. 1 } { 0 0 .. 0 } • Core 1 Read: RequestData • Core C Write: Request{Data, 2 Inv+ 2 Acks}(Cleanup) • Charge Write for all necessary & unnecessary invalidations • What if all invalidations necessary? Charge reads that get data! 22
1. Communication: No Extra Invalidations Core 1 Core 2 Core C Private cache Private cache Private cache Interconnection network Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data {1 2 3 4 .. C-1 C} {1 0 0 0 .. 0 0} {0 0 0 0 .. 0 1} {0 0 0 0 .. 0 0} • Core 1 Read: RequestData+ {Inv + Ack} (in future) • Core C Write: RequestData(Cleanup) • If all invalidations necessary, coherence adds • Bounded overhead to each miss -- Independent of #cores! 23
1. Communication Overhead (1) Communication overhead bounded & scalable (a) Without Sharing & Dirty (b) Without Sharing & Clean (c) Shared Read Miss (charge future inv + ack) (d) Shared Write Miss (not charged for inv+ acks) • But depends on tracking exact sharers (next) 24
Total CommunicationC Read Misses per Write Miss Exact (unbounded storage) Inexact (32b coarse vector) How get performance of “exact” w/ reasonable storage? 25
Outline Motivation & Coherence Background Scalability Challenges • Communication: Extra bookkeeping messages (longer section) • Storage: Extra bookkeeping storage • Enforcing Inclusion: Extra recall messages • Latency: Indirection on some requests • Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches Criticisms & Summary 26
2. Storage Overhead (Small Chip) Core 1 Core 2 Core C Block in private cache state tag block data Private cache Private cache Private cache • Track up to C=#readers (cores) per LLC block • Small #Cores: C bit vector acceptable • e.g., 16 bits for 16 cores : 2 bytes / 72 bytes = 3% ~2 bits ~64 bits ~512 bits Interconnection network Block in shared cache tracking bits state tag block data ~C bits ~2 bits ~64 bits ~512 bits 27
2. Storage Overhead (Larger Chip) Cluster 1 Cluster K Cluster of K cores Cluster of K cores core core core core core core • Use Hierarchy! private cache private cache private cache private cache private cache private cache Intra-clusterInterconnection network Intra-clusterInterconnection network tracking state bits tag block data Cluster Cache tracking state bits tag block data Cluster Cache Cache Inter-cluster Interconnection network tracking state bits tag block data Shared last-level cache {11..1} S … {10..1} S … {11..1 … 10..1} S … 28 {1 … 1} S …
2. Storage Overhead (Larger Chip) • Medium-Large #Cores: Use Hierarchy! • Cluster: K1 cores with L2 cluster cache • Chip: K2 clusters with L3 global cache • Enables K1*K2 Cores • E.g., 16 16-core clusters • 256 cores (16*16) • 3% storage overhead!! • More generally? 29
Storage Overhead for Scaling (2) Hierarchy enables scalable storage 16 clusters of16 cores each 30
Outline Motivation & Coherence Background Scalability Challenges • Communication: Extra bookkeeping messages (longer section) • Storage: Extra bookkeeping storage • Enforcing Inclusion: Extra recall messages (subtle) • Latency: Indirection on some requests • Energy: dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Criticisms & Summary 31
3. Enforcing Inclusion (Subtle) • Inclusion: Block in a private cache In shared cache + Augment shared cache to trackprivate cache sharers (as assumed) • Replace in shared cache Replace in private c. • Make impossible? • Requires too much shared cache associativity • E.g., 16 cores w/ 4-way caches 64-way assoc • Use recall messages • Make recall messages necessary & rare 32
Inclusion Recall Example Core 0 Core 1 Core 2 Write C Core 3 • Shared cache miss to new block C • Needs to replace (victimize) block B in shared cache • Inclusion forces replacement of B in private caches Private cache Private cache Private cache Private cache A: M, … B: B: S, … S, … Interconnection network Bank 0 Bank 1 Bank 3 Bank 2 A: {1000} M … B: {0110} S … 33
Make All Recalls Necessary Exact state tracking (cover earlier) + L1/L2 replacement messages (even clean) = Every recall message finds cached block Every recall message necessary & occurs after a cache miss (bounded overhead) 34
Make Necessary Recalls Rare Assume misses to random sets [Hill & Smith 1989] • Recalls naturally rare when Shared Cache Size/ ΣPrivate Cache sizes > 2 (3) Recalls made rare Core i7 35
Outline Motivation & Coherence Background Scalability Challenges • Communication: Extra bookkeeping messages (longer section) • Storage: Extra bookkeeping storage • Enforcing Inclusion: Extra recall messages • Latency: Indirection on some requests • Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches Criticisms & Summary 36
4. Latency Overhead – Often None Core 1 Core 2 Core C Private cache Private cache Private cache Interconnection network Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data • None: private hit • “None”: private miss + “direct” shared cache hit • “None”: private miss + shared cache miss • BUT … 37
4. Latency Overhead -- Some Core 1 Core 2 Core C Private cache Private cache Private cache Interconnection network Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data 4. 1.5-2X: private miss + shared cache hit with indirection(s) • How bad? 38
4. Latency Overhead -- Indirection 4. 1.5-2X: private miss + shared cache hit with indirection(s) interconnect + cache + interconnect + cache + interconnect --------------------------------------------------------------------------------------------- interconnect + cache + interconnect • Acceptable today • Relative latency similar w/ more cores/hierarchy • Vs. magically having data at shared cache (4) Latency overhead bounded & scalable
5. Energy Overhead • Dynamic -- Small • Extra message energy – traffic increase small/bounded • Extra state lookup – small relative to cache block lookup • … • Static – Also Small • Extra state – state increase small/bounded • … • Little effect on energy-intensive cores, cache data arrays, off-chip DRAM, secondary storage, … • (5) Energy overhead bounded & scalable
Outline Motivation & Coherence Background Scalability Challenges • Communication: Extra bookkeeping messages (longer section) • Storage: Extra bookkeeping storage • Enforcing Inclusion: Extra recall messages (subtle) • Latency: Indirection on some requests • Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Apply analysis to caches used by AMD Criticisms & Summary 41
Review Inclusive Shared Cache Core 1 Core 2 Core C Private cache Private cache Private cache Interconnection network tracking bits state tag block data ~1 bit per core ~2 bits ~64 bits ~512 bits • Inclusive Shared Cache: • Block in a private cache In shared cache • Blocks must be cached redundantly 42
Non-Inclusive Shared Cache Core 1 Core 2 Core C Private cache Private cache Private cache Interconnection network 2. InclusiveDirectory (probe filter) 1. Non-InclusiveShared Cache state tag block data tracking bits state tag ~2 bits ~64 bits ~512 bits ~1 bit per core ~2 bits ~64 bits Dataless Ensures coherence But duplicates tags Any size or associativity • Avoids redundant caching • Allows victim caching 43
Non-Inclusive Shared Cache • Non-Inclusive Shared Cache: Data Block + Tag(Any Configuration ) • Inclusive Directory: Tag (Again) + State • Inclusive Directory == Coherence State Overhead • WITH TWO LEVELS • Directory size proportional to sum of private cache sizes • 64b/(48b+512b) * 2 (for rare recalls) = 22% * Σ L1 size • Coherence overhead higher than w/ inclusion 44
Non-Inclusive Shared Caches WITH THREE LEVELS • Cluster has L2 cache & cluster directory • Cluster directory points to cores w/ L1 block (as before) • (1) Size = 22% * ΣL1s sizes • Chip has L3 cache & global directory • Global directory points to cluster w/ block in • (2) Cluster directory for size 22% * ΣL1s + • (3) Cluster L2 cache for size 22% * ΣL2s • Hierarchical overhead higher than w/ inclusion 45
Outline Motivation & Coherence Background Scalability Challenges • Communication: Extra bookkeeping messages (longer section) • Storage: Extra bookkeeping storage • Enforcing Inclusion: Extra recall messages (subtle) • Latency: Indirection on some requests • Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Criticisms & Summary 46
Some Criticisms (1) Where are workload-driven evaluations? • Focused on robust analysis of first-order effects (2) What about non-coherent approaches? • Showed compatible of coherence scales (3) What about protocol complexity? • We have such protocols today (& ideas for better ones) (4) What about multi-socket systems? • Apply non-inclusive approaches (5) What about software scalability? • Hard SW work need not re-implement coherence
Executive Summary • Today chips provide shared memory w/ HW coherence as low-level support for OS & application SW • As #cores per chip scales? • Some argue HW coherence gone due to growing overheads • We argue it’s stays by managing overheads • Develop scalable on-chip coherence proof-of-concept • Inclusive caches first • Exact tracking of sharers & replacements (key to analysis) • Larger systems need to use hierarchy (clusters) • Overheads similar to today’s Compatibility of on-chipHW coherence is here to stay • Let’s spend programmer sanity on parallelism,not lost compatibility! 48