Advanced Microarchitecture

Advanced Microarchitecture Lecture 12: Caches and Memory

1 1 “6T SRAM” cell 2 access gates 2T per inverter b b SRAM {Over|Re}view • Chained inverters maintain a stable state • Access gates provide access to the cell • Writing to a cell involves over-powering the two small storage inverters 0 1 0 1 Lecture 12: Caches and Memory

Why are we reading both b and b? 64×1-bit SRAM Array Organization 1-of-8 Decoder “Wordline” 1-of-8 Decoder “Bitlines” “Column Mux” Lecture 12: Caches and Memory

*Long* metal line with a lot of parasitic loading SRAM Density vs. Speed • 6T cell must be small as possible to have dense storage • Bigger caches • Smaller transistors  slower transistors So dinky inverters cannot drive their outputs very quickly… Lecture 12: Caches and Memory

b Small cell discharges bitline very slowly b Sense amp “sees” the difference quickly and outputs b’s value Wordline enabled Sense Amplifiers • Type of differential amplifier • Two inputs, amplifies the difference X Diff Amp Bitlines precharged To Vdd a × (X – Y) + Vbias Y Sometimes precharge bitlines to Vdd/2 which makes a bigger “delta” for faster sensing Lecture 12: Caches and Memory

Wordline2 b2 b2 Wordlines = 2 × ports Bitlines = 4 × ports Area = O(ports2) Multi-Porting Wordline1 b1 b1 Lecture 12: Caches and Memory

Port Requirements • ARF, PRF, RAT all need many read and write ports to support superscalar execution • Luckily, these have limited number of entries/bytes • Caches also need multiple ports • Not as many ports • But the overall size is much larger Lecture 12: Caches and Memory

Delay Of Regular Caches • I$ • low port requirement (one fetch group/$-line per cycle) • latency only exposed on branch mispredict • D$ • higher port requirement (multiple LD/ST per cycle) • latency often on critical path of execution • L2 • lower port requirement (most accesses hit in L1) • latency less important (only observed on L1 miss) • optimizing for hit rate usually more important than latency • difference between L2 latency and DRAM latency is large Lecture 12: Caches and Memory

Banking Big 4-ported L1 Data Cache Decoder Decoder Decoder Decoder SRAM Array Decoder SRAM Array Decoder SRAM Array S S Decoder SRAM Array Decoder SRAM Array S S Sense Sense Sense Sense Column Muxing 4 Banks, 1 port each Each bank is much faster Slow due to quadratic area growth Lecture 12: Caches and Memory

Bank Conflicts • Banking provides high bandwidth • But only if all accesses are to different banks • Banks typically address interleaved • For N banks • Addr  bank[Addr % N] • Addr on cache line granularity • For 4 banks, 2 accesses, chance of conflict is 25% • Need to match # banks to access patterns/BW Lecture 12: Caches and Memory

RAM CAM Associativity • You should know this already foo foo foo foo’s value foo foo’s value foo’s value foo set associative CAM/RAM hybrid? direct mapped fully associative Lecture 12: Caches and Memory

Set-Associative Caches • Set-associativity good for reducing conflict misses • Cost: slower cache access • often dominated by the tag array comparisons • Basically mini-CAM logic • Must trade off: • Smaller cache size • Longer latency • Lower associativity • Every option hurts performance = = = = 40-50 bit comparison! Lecture 12: Caches and Memory

= = = = Tag check still occurs to validate way-pred Way-Prediction • If figuring out the way takes too long, then just guess! “E” Way Pred S X X X Load PC Payload • May be hard to predict way if the same load accesses different addresses Lecture 12: Caches and Memory

Way-prediction continues to hit Way-Prediction (2) • Organize data array s.t. left most way is the MRU MRU LRU Accesses On way-miss, move block to MRU position Way-Miss (Cache Hit) Way-prediction keeps hitting Way-predict the MRU way Complication: data array needs datapath for swapping blocks (maybe 100’s of bits) Normally just update a few LRU bits in the tag array (< 10 bits?) Lecture 12: Caches and Memory

= = = = Tag array lookup now much faster! Partial Tagging • Like BTBs, just use part of the tag = = = = Partial tags lead to false hits: Tag 0x45120001 looks like a hit for Address 0x3B120001 Similar to way-prediction, full tag comparison still needed to verify “real” hit --- not on critical path Lecture 12: Caches and Memory

… in the LSQ • Partial tagging can be used in the LSQ as well Do address check on partial addresses only On a partial hit, forward the data Slower complete tag check verifies the match/no match Replay or flush as needed If a store finds a later partially-matched load, don’t do pipeline flush right away Penalty is too severe, wait for slow check before flushing the pipe Lecture 12: Caches and Memory

Interaction With Scheduling • Bank conflicts, way-mispredictions, partial-tag false hits • All change the latency of the load instruction • Increases frequency of replays • more “replay conditions” exist/encountered • Need careful tradeoff between • performance (reducing effective cache latency) • performance (frequency of replaying instructions) • power (frequency of replaying instructions) Lecture 12: Caches and Memory

Alternatives to Adding Associativity • More Set-Assoc needed when number of items mapping to same cache set > number of ways • Not all sets suffer from high conflict rates • Idea: provide a little extra associativity, but not for each and every set Lecture 12: Caches and Memory

Victim Cache A B C D X Y Z J K L M P Q R Victim Cache E A B C A B C D D E A B C A C B D X Y Z K M L J N J M K J L K L M N J K L P Q R Every access is a miss! ABCED and JKLMN do not “fit” in a 4-way set associative cache Victim cache provides a “fifth way” so long as only four sets overflow into it at the same time Can even provide 6th or 7th … ways Lecture 12: Caches and Memory

Skewed Associativity Regular Set-Associative Cache A B Lots of misses C B D D A X Y C W Z W X Y Z Skewed-Associative Cache D Fewer of misses X B A Z W C Y Lecture 12: Caches and Memory

Required Associativity Varies • Program stack needs very little associativity • spatial locality • stack frame is laid out sequentially • function usually only refers to own stack frame f() Layout in 4-way Cache Call Stack g() Addresses laid out in linear organization h() MRU LRU j() Associativity not being used effectively k() Lecture 12: Caches and Memory

Lots of conflicts! “Regular” Cache Stack Cache Stack Cache f() g() “Nice” stack accesses h() j() k() Disorganized heap accesses Lecture 12: Caches and Memory

Stack Cache (2) • Stack cache portion can be a lot simpler due to direct-mapped structure • relatively easily prefetched for by monitoring call/retn’s • “Regular” cache portion can have lower associativity • doesn’t have conflicts due to stack/heap interaction Lecture 12: Caches and Memory

Stack Cache (3) • Which cache does a load access? • Many ISA’s have a “default” stack-pointer register LDQ 0[$sp] Stack Cache MOV $t3 = $sp LDQ 12[$sp] LDQ 8[$t3] Need stack base and offset information, and then need to check each cache access against these bounds Wrong cache  replay X Regular Cache LDQ 24[$sp] LDQ 0[$t1] Lecture 12: Caches and Memory

Multi-Lateral Caches • Normal cache is “uni-lateral” in that everything goes into the same place • Stack cache is an example of “multi-lateral” caches • multiple cache structures with disjoint contents • I$ vs. D$ could be considered multi-lateral Lecture 12: Caches and Memory

Access Patterns • Stack cache showed how different loads exhibit different access patterns Stack (multiple push/pop’s of frames) Heap (heavily data-dependent access patterns) Streaming (linear accesses with low/no reuse) Lecture 12: Caches and Memory

while(some condition) { struct tree_t * parent = getNextRoot(…); if(parent->valid) { doTreeTraversalStuff(parent); doMoreStuffToTree(parent); pickFruitFromTree(parent); } } parent->valid accessed once, and then not used again Fields map to different cache lines Low-Reuse Accesses • Streaming • once you’re done decoding MPEG frame, no need to revisit • Other structtree_t { int valid; intother_fields[24]; intnum_children; structtree_t * children; }; Lecture 12: Caches and Memory

If not accessed again, eventually LRU’d out Fill on miss First-time misses are placed in filter cache If accessed again, promote to the main cache Filter Caches • Several proposed variations • annex cache, pollution control cache, etc. Main Cache Main cache only contains lines with proven reuse One-time-use lines have been filtered out Small Filter Cache Can be thought of as the “dual” of the victim cache Lecture 12: Caches and Memory

Trouble w/ Multi-Lateral Caches • More complexity • load may need to be routed to different places • may require some form of prediction to pick the right one • guessing wrong can cause replays • or accessing multiple in parallel increases power • no bandwidth benefit • more sources to bypass from • costs both latency and power in bypass network Lecture 12: Caches and Memory

Memory-Level Parallelism (MLP) • What if memory latency is 10000 cycles? • Not enough traditional ILP to cover this latency • Runtime dominated by waiting for memory • What matters is overlapping memory accesses • MLP: “number of outstanding cache misses [to main memory] that can be generated and executed in an overlapped manner.” • ILP is a property of a DFG; MLP is a metric • ILP is independent of the underlying execution engine • MLP is dependent on the microarchitecture assumptions • You can measure MLP for uniprocessor, CMP, etc. Lecture 12: Caches and Memory

uArchs for MLP • WIB – Waiting Instruction Buffer WIB Scheduler Scheduler Load miss Load miss No instructions in forward slice can execute Eventually expose other independent load misses (MLP) Move forward slice to separate buffer Eventually all independent insts issue and scheduler contains only insts in the forward slice… stalled Independent insts continue to issue New insts keep the scheduler busy Lecture 12: Caches and Memory

WIB Hardware • Similar to replay – continue issuing dependent instructions, but need to shunt to the WIB • WIB hardware can potentially be large • WIB doesn’t do scheduling – no CAM logic needed • Need to redispatch from WIB back into RS when load comes back from memory • like redispatching from replay-queue Lecture 12: Caches and Memory

Advanced Microarchitecture