Memory Hierarchy in Computer Organization: Exploiting Speed and Size

Computer Organization and ArchitectureChapter 7 Large and Fast: Exploiting Memory Hierarchy Yu-Lun Kuo Computer Sciences and Information Engineering University of Tunghai, Taiwan sscc6991@gmail.com

Major Components of a Computer Processor Devices Control Input Memory Datapath Output

µProc 55%/year (2X/1.5yr) DRAM 7%/year (2X/10yrs) Processor-Memory Performance Gap “Moore’s Law” Processor-Memory Performance Gap(grows 50%/year)

Introduction • The Principle of Locality • Program access a relatively small portion of the address space at any instant of time. • Two Different Types of Locality • Temporal Locality (Locality in Time) • If an item is referenced, it will tend to be referenced again soon • e.g., loop, subrouting, stack, variable of counting • Spatial Locality (Locality in Space) • If an item is referenced, items whose addresses are close by tend to be referenced soon • e.g., array access, accessed sequentially

Memory Hierarchy • Memory Hierarchy • A structure that uses multiple levels of memories; as the distance form the CPU increase, the size of the memories and the access time both increase • Locality + smaller HW is faster = memory hierarchy • Levels • each smaller, faster, more expensive/byte than level below • Inclusive • data found in top also found in the bottom

Three Primary Technologies • Building Memory Hierarchies • Main Memory • DRAM (Dynamic random access memory) • Caches (closer to the processor) • SRAM (static random access memory) • DRAM vs. SRAM • Speed : DRAM < SRAM • Cost: DRAM < SRAM

Introduction • Cache memory • Made by SRAM (Static RAM) • Small amount of fast and high speed memory • Sits between normal main memory and CPU • May be located on CPU chip or module

Introduction • Cache memory

A Typical Memory Hierarchy c.2008 Split instruction & data primary caches (on-chip SRAM) Multiple interleaved memory banks (off-chip DRAM) L1 Instruction Cache Unified L2 Cache Memory CPU Memory Memory L1 Data Cache RF Memory Multiported register file (part of CPU) Large unified secondary cache (on-chip SRAM)

A Typical Memory Hierarchy • By taking advantage of the principle of locality • Can present the user with as much memory as is available in the cheapest technology • at the speed offered by the fastest technology On-Chip Components Control eDRAM Secondary Memory (Disk) Instr Cache Second Level Cache (SRAM) ITLB Main Memory (DRAM) Datapath Data Cache RegFile DTLB Speed (%cycles): ½’s 1’s 10’s 100’s 1,000’s Size (bytes): 100’s K’s 10K’s M’s G’s to T’s Cost: highest lowest

Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM 4-8 bytes (word) 8-32 bytes (block) 1 to 4 blocks 1,024+ bytes (disk sector = page) Characteristics of Memory Hierarchy Processor Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory (Relative) size of the memory at each level

Memory Hierarchy List • Registers • L1 Cache • L2 Cache • L3 cache • Main memory • Disk cache • Disk (RAID) • Optical (DVD) • Tape

Why IC and DC need?

Lower Level Memory Upper Level Memory To Processor Blk X From Processor Blk Y The Memory Hierarchy: Terminology • Hit: data is in some block in the upper level (Blk X) • Hit Rate: the fraction of memory accesses found in the upper level • Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss

Lower Level Memory Upper Level Memory To Processor Blk X From Processor Blk Y The Memory Hierarchy: Terminology • Miss: data is not in the upper level so needs to be retrieve from a block in the lower level (Blk Y) • Miss Rate = 1 - (Hit Rate) • Miss Penalty • Time to replace a block in the upper level + Time to deliver the block the processor • Hit Time << Miss Penalty

How is the Hierarchy Managed? • registers  memory • by compiler (programmer?) • cache  main memory • by the cache controller hardware • main memory  disks • by the operating system (virtual memory) • virtual to physical address mapping assisted by the hardware (TLB) • by the programmer (files)

7.2 The basics of Caches • Simple cache • The processor requests are each one word • The block size is one word of data • Two questions to answer (in hardware): • Q1: How do we know if a data item is in the cache? • Q2: If it is, how do we find it?

Caches • Direct Mapped • Assign the cache location based on the address of the word in memory • Address mapping: (block address) modulo (# of blocks in the cache) • First consider block sizes of one word

Direct Mapped (Mapping) Cache

Caches • Tag • Contain the address information required to identify whether a word in the cache corresponds to the requested word • Valid bit • After executing many instructions, some of the cache entries may still be empty • Indicate whether an entry contains a valid address • If valid bit = 0, there cannot be a match for this block

01 4 11 15 Direct Mapped Cache • Consider the main memory word reference string 0 1 2 3 4 3 4 15 Start with an empty cache - all blocks initially marked as not valid 0 miss 1 miss 2 miss 3 miss 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(0) 00 Mem(1) 00 Mem(0) 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(3) miss 3 hit 4 hit 15 miss 4 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(3) • 8 requests, 6 misses

Hits vs. Misses • Read hits • this is what we want! • Read misses • stall the CPU, fetch block from memory, deliver to cache, restart • Write hits • can replace data in cache and memory (write-through) • write the data only into the cache (write-back the cache later) • Write misses • read the entire block into the cache, then write the word

What happens on a write? • Write work somewhat differently • Suppose on a store instruction • Write the data into only the data cache • Memory would have different value • The cache & memory are “inconsistent” • Keep the main memory & cache • Always write the data into both the memory and the cache • Called write-through(直接寫入)

What happens on a write? • Although this design handles writes simple • Not provide very good performance • Every write causes the data to be written to main memory • Take a long time • Ex. 10% of the instructions are stores CPI without cache miss: 1.0 spending 100 extra cycles on every write CPI = 1.0 + 100 x 10% = 11 reducing performance

Cache Processor DRAM Write Buffer Write Buffer for Write Through • A Write Buffer is needed between the Cache and Memory (TLB: Translation Lookaside Buffer 轉譯旁觀緩衝區) • A queue that holds data while the data are waiting to be written to memory • Processor: • writes data into the cache and the write buffer • Memory controller: • write contents of the buffer to memory

What happens on a write? • Write back (間接寫入) • New value only written only to the block in the cache • The modified block is written to the lower level of the hierarchy when it is replaced

What happens on a write? • Write Through • All writes go to main memory as well as cache • Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to date • Lots of traffic • Slows down writes • Write Back • Updates initially made in cache only • Update bit for cache slot is set when update occurs • If block is to be replaced, write to main memory only if update bit is set • Other caches get out of sync

Memory System to Support Caches • It is difficult to reduce the latency to fetch the first word from memory • We can reduce the miss penalty if increase the bandwidth from the memory to the cache CPU CPU CPU Multiplexor Cache Cache Cache bus bus bus Memory Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3 Memory

One-word-wide memory organization • Assume • A cache block for 4 words • 1 memory bus clock cycle to send the address • 15 clock cycles for DRAM access initiated • 1 memory bus clock cycle to return a word of data • Miss penalty: 1+ 4x15 + 4x1 = 65 clock cycles • Number of bytes transferred per bus clock cycle for a single miss • 4 x 4 / 65 = 0.25 CPU Cache bus Memory

Wide memory organization • Assume • A cache block for 4 words • 1 memory bus clock cycle to send the address • 15 clock cycles for DRAM access initiated • 1 memory bus clock cycle to return a word of data • Two word wide • 1 + 2 x 15 + 2 x 1 = 33 clock cycles • 4 x 4 / 33 = 0.48 • Four word wide • 1 + 1 x 15 + 1 x 1 = 17 clock cycles • 4 x 4 / 17 = 0.94 CPU Multiplexor Cache bus Memory

Interleaved memory organization • Assume • A cache block for 4 words • 1 memory bus clock cycle to send the address • 15 clock cycles for DRAM access initiated • 1 memory bus clock cycle to return a word of data • Each memory bank: 1 word wide • Advance: One latency time • 1 + 1 x 15 + 4 x 1 = 20 clock cycle • 4 x 4 / 20 = 0.8 byte/clock • 3 times for one-word-wide CPU Cache bus Memory bank 0 Memory bank 3 Memory bank 1 Memory bank 2

Q & A

Memory Hierarchy in Computer Organization: Exploiting Speed and Size