520 likes | 625 Views
Using Prediction to Accelerate Coherence Protocols. Shubhendu S Mukherjee and Mark D Hill University of Wisconsin Madison. The topic once again. Using Prediction to Accelerate Coherence Protocols Discuss the concept of using prediction in a coherence protocol
E N D
Using Prediction to Accelerate Coherence Protocols Shubhendu S Mukherjee and Mark D Hill University of Wisconsin Madison
The topic once again • Using Prediction to Accelerate Coherence Protocols • Discuss the concept of using prediction in a coherence protocol • See how it can be used to accelerate the protocol
Organization • Introduction • Background • Directory Protocol • Two-level Branch Predictor • Cosmos • Basic Structure • Obtaining Predictions • Implementation Issues • Integration with a Coherence Protocol • How and When to act on the predictions • Handling Mis-predictions • Performance • Evaluation • Benchmarks • Results • Summary and Conclusions
Introduction • Large shared memory multi processors suffer from long latencies for misses to remotely cached blocks • Proposals to lessen these latencies • Multithreading • Non-blocking caches • Application specific coherence protocols • Predict future sharing patterns, overlap execution with coherence work • Drawbacks • More complex program model • Require sophisticated compilers • Existing predictors are directed at specific sharing patterns known a priori • Need for a general predictor, hence this paper!
Introduction • If general predictor is not in the army then what is it? A general predictor would sit beside standard directory or cache module, monitor coherence activity and take appropriate actions • See the design of Cosmos coherence message predictor • Evaluate Cosmos on some scientific applications • Alls well that ends well? Summarize and conclude
Background: 6810 strikes back! • Structure of a Directory Protocol • Distributed memory multiprocessor • Hardware based cache coherence • Directory and memory distributed among processors • Physical address gives the location of memory • Nodes connected to each other via a scalable interconnect • Messages routed from sender to receiver • Directory keeps track of sharing states, which are?
Directory Structure Processor & Caches Processor & Caches Processor & Caches Processor & Caches Memory I/O Memory I/O Memory I/O Memory I/O Directory Directory Directory Directory Interconnection network
Example: Coherence Protocol Actions Wr A A Processor 1 & Caches Processor 2 & Caches • ? • ? • ? • ? • ? Memory I/O Memory I/O 3 1 5 4 Directory Directory 2 Interconnection network
Example: Coherence Protocol Actions Wr A A Processor 1 & Caches Processor 2 & Caches • P1 Wr request to Dir 1 • Dir 1 Inval request Dir 2 • Dir 2 Inval Cache copy P2 • Dir2 Inval response Dir 1 • Dir 1 Wr response P1 Memory I/O Memory I/O 3 1 5 4 Directory Directory 2 Interconnection network
Example: Coherence Protocol Actions Wr A A Processor 1 & Caches Processor 2 & Caches • P1 Wr request to Dir 1 • Dir 1 Inval request Dir 2 • Dir 2 Inval Cache copy P2 • Dir2 Inval response Dir 1 • Dir 1 Wr response P1 Memory I/O Memory I/O 3 1 5 4 Directory Directory 2 Interconnection network Point to ponder: Multiple long-latency operations (sequential)
Background: 6810 strikes back! • Branch predictor • Need: Execute probable instructions without waiting, thus improve performance • Two Level • Basically a Local predictor • Use PC of branch to index into Branch History Table(Local) • Use this BHT entry to index into per branch Pattern History Table to obtain a branch prediction
Two Level Predictor Branch PC Table of 16K entries of 2-bit saturating counters Use 6 bits of branch PC to index into branch history table 10110111011001 14-bit history indexes into next level Table of 64 entries of 14-bit histories for a single branch Pattern History Table
What in Universe is COSMOS? • Cosmos is a Coherence Message Predictor • Predicts the sender and type of next incoming message for a particular block. • Structure : Similar to a two level branch predictor
Structure of Cosmos Message History Register (MHR) Pattern History Tables (Per block address) Message History Table (MHT)
Structure of Cosmos Message History Register (MHR) <sender, type> <sender, type> … Number of tuples per MHR constitutes its depth Message History Table (MHT)
Structure of Cosmos • The first level table is called the Message History Table (MHT) • An MHT consists of a series of Message History Registers (MHR) (one per cache block address) • An MHR contains a sequence of <sender,type> tuples (depth) • The second level table is called the Pattern History Table(PHT) • There is one PHT for each MHR • PHT is indexed by the entry in MHR • Each PHT contains prediction tuples corresponding to MHR entries
An Example: Producer - Consumer repeat … if(producer) private_counter++ shared_counter = private_counter barrier else if(consumer) barrier private_counter = shared_counter else barrier endif … until done
Producer Consumer Processor 1 & Caches Processor 2 & Caches Memory I/O Memory I/O Directory Directory Interconnection network An Example: Producer - Consumer
An Example: Producer - Consumer Producer Cache Memory I/O ? ? 1 2 Directory Messages seen by the Producer Cache (from directory)
An Example: Producer - Consumer Producer Cache Memory I/O 1. Get Wr Response 2. Invalidate Wr request 1 2 Directory Messages seen by the Producer Cache
An Example: Producer - Consumer Consumer Cache Memory I/O ? ? 1 2 Directory Messages seen by the Consumer Cache(from directory)
An Example: Producer - Consumer Consumer Cache Memory I/O 1. Get Rd Response 2. Invalidate Rd request 1 2 Directory Messages seen by the Consumer Cache
An Example: Producer - Consumer ? ? ? ? Messages seen by the Directory
An Example: Producer - Consumer 1. Get Wr Request from producer 2. Invalidate Rd Response from consumer 4. Invalidate Wr Response from producer 3. Get Rd Request from consumer Messages seen by the Directory
An Example: Producer - Consumer • Sharing Pattern Signature • Predictable message patterns • Producer send Get Wr request to directory receive Get Wr response from directory receive Invalidate Wr request from directory send Invalidate Wr response to directory • Consumer send Get Rd request to directory receive Get Rd response from directory receive Invalidate Rd request from directory send Invalidate Rd response to directory
Back to Cosmos Pattern History Table for shared_counter • Directory receives get Rd request from the consumer ? <P2, get Rd request> Message History Table <P2, get Rd request> P1: Producer P2: Consumer Global Address of shared_counter
Back to Cosmos • Directory receives get Rd request from the consumer Pattern History Table for shared_counter <P2, get Rd request> <P1, Inval Wr response> Message History Table <P2, get Rd request> P1: Producer P2: Consumer Global Address of shared_counter
Back to Cosmos • Obtaining Predictions • Index into MHR table with the address of the cache block • Use the MHR entry to index into the corresponding PHT • Return the prediction (if one exists) from the PHT. This prediction is of the form < Sender , Message – type >. • Updating Cosmos • Index into MHR table with the address of the cache block • Use the MHR entry to index into the corresponding PHT • Write new <Sender, Message – type> tuple as prediction for index corresponding to the MHR entry • Insert the <Sender, Message – type> tuple into the MHR for the cache block
How Cosmos adapts to complex signatures • Consider one Producer and two Consumers P1 and P2 Two get Rd requests arrive out of order. PHT will then be as shown below Index Prediction <P1, get Rd request> <P2, get Rd request> <P2, get Rd request> <P1, get Rd request>
How Cosmos adapts to complex signatures MHR with depth greater than 1 Index Prediction <P1, get Rd request> <P3, get Rd request> <P2, get Rd request> <P2, get Rd request> <P1, get Rd request> <P3, get Rd request> <P3, get Rd request> <P2, get Rd request> <P1, get Rd request>
Implementation issues • Storage Issues • Possible to merge the first level table with cache block state at cache and the directory? • Second level table will need more memory to catch pattern histories for each cache block • If number of pattern histories for each cache block is found to be low, per allocate memory for the pattern histories • If more pattern histories needed, allocate them from a common pool of dynamically allocated memory • Higher prediction accuracies require higher MHR depths : may result in large amounts of memory
Integration with a Coherence protocol • Predictors sit beside cache and directory module and accelerate coherence activity in two steps: • Step 1: Monitor message activity and make a prediction • Step 2: Invoke an action based on the prediction • Key challenges: • Knowing how and when to act on the predictions • Handling Mis – predictions • Performance
How to act on predictions • Some Examples
Detecting and Handling Mis-predictions • Usual problem with predictions • Mis-predictions may leave processor state / protocol state in an inconsistent state • Actions taken after predictions can be classified into three categories • Actions that move the protocol between two legal states • Actions that move the protocol to a future state, but do not expose this state to the processor • Actions that allow both processor and the protocol to move to future states
Handling Mis-Predictions • Actions that move the protocol between two legal states Example : Replacement of a cache block that moves the block from “exclusive” to “invalid” state No explicit recovery in this case P1 Cache Directory P2 Cache Time Get Wr request Inval Wr response Get Wr response
Handling Mis-Predictions • Actions that move the protocol to a future state, but do not expose this state to the processor If mis-prediction, simply discard the future state If prediction is correct, commit the future state and expose it to the processor P1 Cache Directory P2 Cache Time Predicts, updates protocol state, generates message Get Wr request Inval Wr request Sends message Inval Wr response Get Wr response
Handling Mis-Predictions P1 Cache Directory P2 Cache Time Predicts, updates protocol state, generates message Mis-Predict Send correct response
Handling Mis-Predictions • Actions that allow both processor and the protocol to move to future states • Need greater support for recovery • Before speculation, both processor and protocol can checkpoint their states • On detecting Mis-predictions , they rollback to the check pointed states • On correct prediction, the current protocol and processor states must be committed
Performance • How prediction affects runtime • A simplistic execution model is as follows. Let : p be the prediction accuracy for each message, f be the fraction of delay incurred on messages predicted correctly (e.g .f = 0 means that the time of a message predicted correctly is completely overlapped with other delays), and r be the penalty due to a mis-predicted message (e.g., r = O.5 implies a mis-predicted message takes 1.5 times the delay of a message without prediction).
Performance • How prediction affects runtime p be the prediction accuracy for each message, f be the fraction of delay incurred on messages predicted correctly r be the penalty due to a mis-predicted message If performance is completely determined by the number of messages in the critical path of a parallel program, then speedup due to prediction is: time(w/o prediction) 1 ----------------------------- = ----------------------------- time (with prediction) p * f + (1-p) * (1+r)
Performance E.g.: For a prediction accuracy of 80% (p=0.8), speedup = 56% with a mis-prediction penalty of 100%(r=1) and a prediction success benefit of 30% (f=0.3)
Evaluation • Cosmos’ prediction accuracy is evaluated using traces of coherence messages obtained from the Wisconsin Stache protocol running five parallel scientific applications • Wisconsin Stache protocol Stache is a software, full-map,and write-invalidate directory protocol that uses part of local memory as a cache for remote data. • Benchmarks Five parallel scientific applications: appbt, barnes, dsmc, moldyn, unstructured
Benchmarks • Appbt Appbt is a parallel three-dimensional computational fluid dynamics application. • Barnes Barnes simulates the interaction of a system of bodies in three dimensions using the Barnes-Hut hierarchical N-body method. • Dsmc Dsmc studies the properties of a gas by simulating the movement and collision of a large number of particles in a three-dimensional domain with discrete simulation Monte Carlo method.
Benchmarks • Moldyn Moldyn is a molecular dynamics application. • Unstructured Unstructured is a computational fluid dynamics application that uses an unstructured mesh to model a physical structure,such as an airplane wing or body.
Results C: cache prediction rate D: Directory prediction rate O: Overall prediction rate
Results C: cache prediction rate D: Directory prediction rate O: Overall prediction rate
Results: Observations • Overall prediction accuracy :62 ~ 86% • Higher accuracy for cache compared to directory: Why ? • Prediction accuracy increases with an increase in MHR depth • However, not much increase beyond MHR depth of 3 • Appbt: High prediction accuracy Producer-consumer sharing pattern Producer reads, writes and consumer reads • Barnes : Lower accuracy than other applications Nodes of octree are assigned different shared memory addresses in different iterations
Results: Observations • Dsmc: Highest accuracy among all applications Producer-consumer sharing patterns Producer writes and consumer reads Why higher than Appbt? • Moldyn: High accuracy Migratory and producer-consumer sharing patterns • Unstructured: Different dominant signatures for same data structures in different phases of the application Migratory and producer-consumer sharing patterns
Effects of noise-filters • Remember them? • Cosmos noise filter: Saturating counter : 0 to MAXCOUNT, here till 2 • MHR depth >2, filters do not help much – Why? • Predictors with MHR>1 can adapt to noise, greater accuracy for repeating noise 0 ,1, 2: MAXCOUNT
Summary and Conclusions • Comparison with directed optimizations • Worse: Less cost effective as more hardware required • Better: • Including the composition of predictors of several directed optimizations in a single protocol will be more complex than a single Cosmos • Can discover application-specific sharing patterns not known a priori