400 likes | 596 Views
ADVANCED COMPUTER ARCHITECTURE CS 4330/6501 Memory. Samira Khan University of Virginia Mar 20, 2019. The content and concept of this course are adapted from CMU ECE 740. AGENDA. Logistics Review from the last class Memory. LOGISTICS. Mar 25: Project milestone presentations
E N D
ADVANCED COMPUTER ARCHITECTURE CS 4330/6501 Memory Samira Khan University of Virginia Mar 20, 2019 The content and concept of this course are adapted from CMU ECE 740
AGENDA • Logistics • Review from the last class • Memory
LOGISTICS • Mar 25: Project milestone presentations • 8 mins + 2 mins • Brief description of background and motivation • Discussion of current state • Next steps • Apr 1: Student Presentation 3
PREDICTING BRANCH BASED ON CORRLETIONs • Last-time and 2BC predictors exploit “last-time” predictability • Realization 1: A branch’s outcome can be correlated with other branches’ outcomes • Global branch correlation • Realization 2: A branch’s outcome can be correlated with past outcomes of the same branch (other than the outcome of the branch “last-time” it was executed) • Local branch correlation
TWO LEVEL GLOBAL BRANCH PREDICTION • First level: Global branch history register (N bits) • The direction of last N branches • Second level: Table of saturating counters for each history entry • The direction the branch took the last time the same history was seen Pattern History Table (PHT) 00 …. 00 1 1 ….. 1 0 00 …. 01 2 3 previous one 00 …. 10 GHR (global history register) index 0 1 11 …. 11 Yeh and Patt, “Two-Level Adaptive Training Branch Prediction,” MICRO 1991.
TWO-LEVEL GSHARE PREDICTOR Direction predictor (2-bit counters) Which direction earlier branches went taken? Global branch history PC + inst size Next Fetch Address XOR Program Counter hit? Address of the current instruction target address Cache of Target Addresses (BTB: Branch Target Buffer)
CAN WE DO BETTER? • Last-time and 2BC predictors exploit “last-time” predictability • Realization 1: A branch’s outcome can be correlated with other branches’ outcomes • Global branch correlation • Realization 2: A branch’s outcome can be correlated with past outcomes of the same branch (other than the outcome of the branch “last-time” it was executed) • Local branch correlation
TWO LEVEL LOCAL BRANCH PREDICTION • First level: A set of local history registers (N bits each) • Select the history register based on the PC of the branch • Second level: Table of saturating counters for each history entry • The direction the branch took the last time the same history was seen Pattern History Table (PHT) 00 …. 00 1 1 ….. 1 0 00 …. 01 2 3 00 …. 10 index 0 1 Local history registers 11 …. 11 Yeh and Patt, “Two-Level Adaptive Training Branch Prediction,” MICRO 1991.
TWO-LEVEL LOCAL HISTORY PREDICTOR Which directions earlier instances of *this branch* went Direction predictor (2-bit counters) taken? PC + inst size Next Fetch Address Program Counter hit? Address of the current instruction target address Cache of Target Addresses (BTB: Branch Target Buffer)
SOME OTHER BRANCH PREDICTOR TYPES • Loop branch detector and predictor • Loop iteration count detector/predictor • Works well for loops, where iteration count is predictable • Used in Intel Pentium M • Perceptron branch predictor • Learns the direction correlations between individual branches • Assigns weights to correlations • Jimenez and Lin, “Dynamic Branch Prediction with Perceptrons,” HPCA 2001. • Hybrid history length based predictor • Uses different tables with different history lengths • Seznec, “Analysis of the O-Geometric History Length branch predictor,” ISCA 2005.
Perceptron Branch Predictor (I) • Idea: Use a perceptron to learn the correlations between branch history register bits and branch outcome • A perceptron learns a target Boolean function of N inputs • Jimenez and Lin, “Dynamic Branch Prediction with Perceptrons,” HPCA 2001. • Rosenblatt, “Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms,” 1962 • Each branch associated with a perceptron • A perceptron contains a set of weights wi • Each weight corresponds to a bit in • the GHR • How much the bit is correlated with the • direction of the branch • Positive correlation: large + weight • Negative correlation: large - weight • Prediction: • Express GHR bits as 1 (T) and -1 (NT) • Take dot product of GHR and weights • If output > 0, predict taken
Perceptron Branch Predictor (I) Dot product of GHR and perceptron weights Output compared to 0 Bias weight (bias of branch independent of the history) 1 -1 ….. 1 -1 w1 w2 ….. wn-1 wn GHR (global history register) Weights • Each weight corresponds to a bit in • the GHR • How much the bit is correlated with the • direction of the branch • Positive correlation: large + weight • Negative correlation: large - weight Prediction: Express GHR bits as 1 (T) and -1 (NT) Take dot product of GHR and weights If output > 0, predict taken
Perceptron Branch Predictor (II) Prediction function: Dot product of GHR and perceptron weights Output compared to 0 Bias weight (bias of branch independent of the history) Training function:
Perceptron Branch Predictor (III) • Advantages + More sophisticated learning mechanism better accuracy • Disadvantages -- Hard to implement -- Can learn only linearly-separable functions e.g., cannot learn XOR type of correlation between 2 history bits and branch outcome
Perceptron Branch Predictor (III) • Disadvantages -- Hard to implement -- Can learn only linearly-separable functions e.g., cannot learn XOR type of correlation between 2 history bits and branch outcome
Perceptron Branch Predictor (III) • Disadvantages -- Hard to implement -- Can learn only linearly-separable functions e.g., cannot learn XOR type of correlation between 2 history bits and branch outcome
Prediction Using Multiple History Lengths • Observation: Different branches require different history lengths for better prediction accuracy • Idea: Have multiple PHTs indexed with GHRs with different history lengths and intelligently allocate PHT entries to different branches Seznec and Michaud, “A case for (partially) tagged Geometric History Length Branch Prediction,” JILP 2006.
h[0:L1] pc pc pc h[0:L2] pc h[0:L3] ctr ctr ctr tag tag tag u u u 1 1 1 1 1 1 1 =? =? =? 1 1 prediction TAGE: Tagged & prediction by the longest history matching entry Tagless base predictor Andre Seznec, “TAGE-SC-L branch predictors again,”CBP 2016.
Miss Hit Pred =? =? 1 1 1 1 1 1 1 =? 1 Hit 1 Altpred: Alternative prediction TAGE: Multiple Tables Andre Seznec, “TAGE-SC-L branch predictors again,”CBP 2016.
TAGE: Which Table to Use? • General case: • Longest history-matching component provides the prediction • Special case: • Many mispredictions on newly allocated entries: weak Ctr On many applications, Altpredmore accuratethan Pred • Property dynamically monitored through 4-bit counters Andre Seznec, “TAGE-SC-L branch predictors again,”CBP 2016.
TAGE: Tagged & prediction by the longest history matching entry • Advantages + Prediction based on more sophisticated history length better accuracy + Does not need any complex operation (e.g., dot product) + Easier to implement • Disadvantages -- Multiple lookups per cycle -- Predicting which history to use is not straight forward
THE MAIN MEMORY SYSTEM • Main memory is a critical component of all computing systems: server, mobile, embedded, desktop, sensor • Main memory system must scale (in size, technology, efficiency, cost, and management algorithms) to maintain performance growth and technology scaling benefits Processor and caches Main Memory Storage (SSD/HDD)
STATE OF THE MAIN MEMORY SYSTEM • Recent technology, architecture, and application trends • lead to new requirements • exacerbate old requirements • DRAM and memory controllers, as we know them today, are (will be) unlikely to satisfy all requirements • Some emerging non-volatile memory technologies (e.g., PCM) enable new opportunities: memory+storage merging • We need to rethink the main memory system • to fix DRAM issues and enable emerging technologies • to satisfy all requirements
MAJOR TRENDS AFFECTING MAIN MEMORY (I) • Need for main memory capacity, bandwidth, QoS increasing • Main memory energy/power is a key system design concern • DRAM technology scaling is ending
MAJOR TRENDS AFFECTING MAIN MEMORY (II) • Need for main memory capacity, bandwidth, QoS increasing • Multi-core: increasing number of cores • Data-intensive applications: increasing demand/hunger for data • Consolidation: cloud computing, GPUs, mobile • Main memory energy/power is a key system design concern • DRAM technology scaling is ending
EXAMPLE TREND: MANY CORES ON CHIP • Simpler and lower power than a single large core • Large scale parallelism on chip Tilera TILE Gx 100 cores, networked Intel Core i78 cores IBM Cell BE8+1 cores AMD Barcelona 4 cores IBM POWER7 8 cores Intel SCC 48 cores, networked Sun Niagara II 8 cores Nvidia Fermi 448 “cores”
CONSEQUENCE: THE MEMORY CAPACITY GAP Core count doubling ~ every 2 years DRAM DIMM capacity doubling ~ every 3 years • Memory capacity per core expected to drop by 30% every two years • Trends worse for memory bandwidth per core!
MAJOR TRENDS AFFECTING MAIN MEMORY (III) • Need for main memory capacity, bandwidth, QoS increasing • Main memory energy/power is a key system design concern • ~40-50% energy spent in off-chip memory hierarchy [Lefurgy, IEEE Computer 2003] • DRAM consumes power even when not used (periodic refresh) • DRAM technology scaling is ending
MAJOR TRENDS AFFECTING MAIN MEMORY (IV) • Need for main memory capacity, bandwidth, QoS increasing • Main memory energy/power is a key system design concern • DRAM technology scaling is ending • ITRS projects DRAM will not scale easily below X nm • Scaling has provided many benefits: • higher capacity(density), lower cost, lower energy
THE DRAM SCALING PROBLEM • DRAM stores charge in a capacitor (charge-based memory) • Capacitor must be large enough for reliable sensing • Access transistor should be large enough for low leakage and high retention time • Scaling beyond 40-35nm (2013) is challenging [ITRS, 2009] • DRAM capacity, cost, and energy/power hard to scale
SOLUTIONS TO THE DRAM SCALING PROBLEM • Two potential solutions • Tolerate DRAM (by taking a fresh look at it) • Enable emerging memory technologies to eliminate/minimize DRAM • Do both • Hybrid memory systems
SOLUTION 1: TOLERATE DRAM • Overcome DRAM shortcomings with • System-DRAM co-design • Novel DRAM architectures, interface, functions • Better waste management (efficient utilization) • Key issues to tackle • Reduce refresh energy • Improve bandwidth and latency • Reduce waste • Enable reliability at low cost
SOLUTION 2: EMERGING MEMORY TECHNOLOGIES • Some emerging resistive memory technologies seem more scalable than DRAM (and they are non-volatile) • Example: Phase Change Memory • Expected to scale to 9nm (2022 [ITRS]) • Expected to be denser than DRAM: can store multiple bits/cell • But, emerging technologies have shortcomings as well • Can they be enabled to replace/augment/surpass DRAM?
HYBRID MEMORY SYSTEMS CPU PCM Ctrl DRAMCtrl DRAM Phase Change Memory (or Tech. X) Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost Slow, wears out, high active energy Hardware/software manage data allocation and movement to achieve the best of multiple technologies
THE PROMISE OF EMERGING TECHNOLOGIES • Likely need to replace/augment DRAM with a technology that is • Technology scalable • And at least similarly efficient, high performance, and fault-tolerant • or can be architected to be so • Some emerging resistive memory technologies appear promising • Phase Change Memory (PCM)? • Spin Torque Transfer Magnetic Memory (STT-MRAM)? • Memristors? • And, maybe there are other ones • Can they be enabled to replace/augment/surpass DRAM?
CHARGE VS. RESISTIVE MEMORIES • Charge Memory (e.g., DRAM, Flash) • Write data by capturing charge Q • Read data by detecting voltage V • Resistive Memory (e.g., PCM, STT-MRAM, memristors) • Write data by pulsing current dQ/dt • Read data by detecting resistance R
LIMITS OF CHARGE MEMORY • Difficult charge placement and control • Flash: floating gate charge • DRAM: capacitor charge, transistor leakage • Reliable sensing becomes difficult as charge storage unit size reduces
ADVANCED COMPUTER ARCHITECTURE CS 4330/6501 Memory Samira Khan University of Virginia Mar 20, 2019 The content and concept of this course are adapted from CMU ECE 740