330 likes | 508 Views
MOSAIC : . The Case for a Scalable Coherence Protocol for Complex On-Chip Cache Hierarchies in Many-Core Systems. Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain). Outline. Motivation Directory Schemas In-cache Sparse MOSAIC Coherence Protocol
E N D
MOSAIC : The Case for a Scalable Coherence Protocol for Complex On-Chip Cache Hierarchies in Many-Core Systems Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain)
Outline • Motivation • Directory Schemas • In-cache • Sparse • MOSAIC Coherence Protocol • Examples • Evaluation Results • Conclusions
Motivation • Performance improvement: more processors per chip • Major challenges: off-chip bandwidth wall • Introduce cache into the chip • Complex on-chip cache hierarchies • Coherence protocol: fundamental role to play
Motivation • What coherence protocol to use with large number of cores: • Broadcast-based protocols high energy requirements • Directory-based protocols more storage necessities for sharing information • MOSAIC: new coherence protocol • Directory without inclusiveness • Token Coherence to guarantee correctness
Outline • Motivation • Directory Schemas • In-cache • Sparse • MOSAIC Coherence Protocol • Examples • Evaluation Results • Conclusions
Directory schemas: In-cache • Each block in LLC includes tag, data and the sharers information • LLC receives requests needs precise knowledge • Inclusiveness is necessary: any block in the private levels needs to be allocated in LLC • Advantage: coherence protocol less complex • Disadvantage: all LLC blocks has storage overhead
Directory schemas: In-cache LLC + in-cache directory P P P P P P P P P P P P Interconnection network Processors and private caches Overhead!!!
Directory schemas: In-cache LLC + in-cache directory P P P P P P P P Interconnection network Processors and private caches Overhead!!! Overhead!!!
Directory schemas: Sparse • Directory entries separated from data • Allocated under demand • Overhead proportional to the aggregate private levels size (not LLC) • Capacity and associativity has to be sufficient to keep private-level cache tags
Directory schemas: Sparse LLC Sparse dir P P P P P P P P Interconnection network Processors and private caches
Directory schemas: Sparse • Duplicate-tag directory: holding all the tags of private levels • Example: 16 cores with 4-way 32KB L1 64-way Associativity = # cores * private caches associativity tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag # sets = # private caches sets tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag
Directory schemas: Sparse Decrease Associativity: now << # cores * private caches associativity • One tag may be in various private caches • More than 1 tag per entry conflicts • Inclusiveness needed invalidate private data (recalls messages) sharers sharers tag sharers tag tag tag tag tag tag tag sharers tag tag tag tag tag sharers sharers tag sharers tag tag tag tag tag tag tag sharers tag tag tag tag tag sharers sharers tag tag sharers tag tag tag tag tag sharers tag tag tag tag tag tag sharers sharers tag sharers tag tag tag tag tag sharers tag tag tag tag tag tag tag sharers sharers tag sharers tag tag tag tag tag tag tag sharers tag tag tag tag tag Increase number of sets sharers sharers tag sharers tag tag tag tag tag tag sharers tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag tag
Outline • Motivation • Directory Schemas • In-cache • Sparse • MOSAIC Coherence Protocol • Examples • Evaluation Results • Conclusions
MOSAIC Protocol • In-cache or sparse it doesn’t matter • No inclusiveness • No invalidations of data in private caches • Reconstruction of sharing information under demand • Uses token counting to avoid extra traffic and guarantee correctness • Token Coherence protocol: • Initially each block := # tokens (==#procs) • Read request: data and 1 token • Write request: data and all tokens
MOSAIC Conceptual Approach 3 4 P0 P1 P2 Private Caches I 0 N/A O 2 DATA S 1 DATA 5 1 1 On-chip network 3 2 Last Level Cache Data_slice Dir_slice Memory Controller I V Sharers 2 I 0 N/A State Num. Tokens Data
MOSAIC Key Facts • When data not present in LLC broadcast for reconstruction • Private caches inform of num. of held tokens • Token counting avoids negative acknowledgements or timeouts • Reconstruction message piggybacks type of request and requestor • Key: directory may replace silently no invalidations
MOSAIC Read Request P0 P1 P2 P3 Dir LLC 3 tokens 1 token Read Reconstruction Invalid State IS Data + token Info 1 token State S Sharers [P2] Owner: ¿? Info 2 tokens Owner • State O Unblock (info 1 token) Sharers [P2, P1] Owner: P1 • State C • State A Sharers [P2, P1, P0] Owner: P1 Read Forward GETS to Owner Data + token Unblock Sharers [P2, P1, P0, P3] Owner: P1
MOSAIC Write Request P0 P1 P2 P3 Dir LLC 3 tokens 1 token Write Reconstruction Invalid State IS Data + 3 tokens 1 token State S • State O • State C Unblock (info all tokens) • State A Sharers [P0] Owner: P0 State IM State M Directory Eviction
Outline • Motivation • Directory Schemas • In-cache • Sparse • MOSAIC Coherence Protocol • Examples • Evaluation Results • Conclusions
Evaluation methodology Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3 Slice 0 Slice 1 Slice 2 Slice 3 Slice 0 Slice 1 Slice 2 Slice 3 Slice 4 Slice 5 Slice 6 Slice 7 Slice 8 Slice 9 Core 4 Core 15 R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Slice 4 Slice 5 Slice 6 Slice 7 Slice 10 Slice 11 Slice 12 Slice 13 Slice 14 Slice 15 Core 5 Core 14 Slice 8 Slice 9 Slice 10 Slice 11 Slice 16 Slice 17 Slice 18 Slice 19 Slice 20 Slice 21 Core 6 Core 13 Slice 22 Slice 23 Slice 24 Slice 25 Slice 26 Slice 27 Slice 12 Slice 13 Slice 14 Slice 15 Core 12 Core 7 Slice 28 Slice 29 Slice 30 Slice 31 Core 4 Core 5 Core 6 Core 7 Core 9 Core 8 Core 11 Core 10
Simulation stack and Workloads • GEMS: full-system evaluation • SLICC: Specification Language for Implementing Cache Coherence
MOSAIC PerformanceReducing associativity Normalized execution time 128KB 16K entries (8 bytes per entry)
Number of misses x2 Normalized num. misses
MOSAIC Performance Reducing associativity and capacity Normalized execution time 128KB 16K entries (8 bytes per entry) 16KB 2K entries
MOSAIC Latency 16KB 2K entries
MOSAIC Link Utilization Average network link utilization
MOSAIC Scalability • 16 cores configuration Normalized link utilization
Conclusions • Low complexity and great scalability • Very low storage overhead • No noticeable energy cost • Alternative for future many-core cache coherent CMPs • Bandwidth scalability of a directory Elegancy of Token Coherence • MOSAIC Coherence Protocol
Realistic Cache Configuration L1: 4-way 32KB / L2: 8-way 256KB x2 full dir 1/10 full dir Normalized execution time - Same experiment with BASE: 20% impact in some cases
MOSAIC Energy Normalized Dynamic Energy