Chapter 7: Memory System Design

Chapter 7: Memory System Design Topics Introduction: The Components of the Memory System The CPU - Memory Interface Memory Structure Memory Boards and Modules Two-Level Memory Hierarchy The Cache Virtual Memory

Components of a Computer System Memory System Input/Output System CPU

Memory System • The memory system houses the information that is processed by the computer system: • Data - variables, images, documents, etc. • Programs - operating system, applications • The memory system has many levels and categories: • CPU registers, cache, main memory, disk, tape • Volatile, Non-volatile • Speed • Access pattern • Capacity

Component • Access patternRandom Random Random Direct Sequential • access access access access access • Capacity, bytes64–1024 8–1024KB 8–64 MB 1–10 GB 1 TB • Latency1–10 ns 20 ns 50 ns 10 ms 10 ms–10 s • Block size 1 word 16 words 16 words 4 KB 4 KB • Bandwidth System 8 MB/s 1 MB/s 1 MB/s 1 MB/s • clock • rate • Cost/MBHigh $500 $30 $0.25 $0.02 C P U T a p e C a c h e M a i n M e m o r y D i s k M e m o r y M e m o r y Tbl 7.3 The Memory Hierarchy, Cost, and Performance Some Typical Values:

The Main Memory System • The system’s disk space and tape backup facility are part of the memory system • Information storage (programs, documents, etc.) • Increasing the amount of storage space available to programs (virtual memory) • However, for this discussion, we will concentrate on the construction and operation of the physical memory in the system (cache, main memory) • Interfacing memory to the CPU • Memory organization • Implementation

Components of the Main Memory System • The main memory system typically consists of two types of memory • ROM - read only memory - used to hold the non-volatile part of the system information • RAM - random access memory - used to hold the system information that can be lost when the system powers down • The term RAM for the volatile portion of the memory is somewhat a misnomer as the ROM memory is also random access, but that is the term that has become standard • Every system must contain some amount of non-volatile ROM to tell the CPU what to do when it first powers up • ROM contains the entire program that the CPU executes - RAM is used only for temporary data storage (dedicated microcontrollers) • ROM holds a small program that loads the main program (an application or operating system) into RAM and executes it - called a “boot ROM” or “boot strap” program (complex embedded systems, desktop systems mainframes)

Tbl 7.1 Some Memory Properties Symbol Definition Intel Intel PowerPC 8088 8086 601w CPU word size 16 bits 16 bits 64 bits m Bits in a logical memory address 20 bits 20 bits 32 bits s Bits in smallest addressable unit 8 bits 8 bits 8 bits b Data bus size 8 bits 16 bits 64 bits 2m Memory word capacity, s-sized wds 220 words220 words232 words 2mxs Memory bit capacity 220 x 8 bits 220 x 8 bits 232 x 8 bits

Big-Endian and Little-Endian Storage • When data types having a word size larger than the smallest addressable unit are stored in memory (eg. a 32 bit floating point number stored in a memory addressed in 16 bit words) there are two ways to do it: • Store the least significant part of the word at the lowest address (little-Endian, little end first) • Store the most significant part of the word at the lowest address (big-Endian, big end first) • Example: The hexadecimal 16-bit number 0xABCD, stored at address 0 in a byte addressable memory: msb ... lsb AB CD Little-Endian Big-Endian AB 1 1 CD CD 0 0 AB

Tbl 7.2 Memory Performance Parameters Symbol Definition Units Meaning ta Access time time Time to access a memory word tc Cycle time time Time from start of access to start of next access k Block size words Number of words per block  Bandwidth words/time Word transmission rate tl Latency time Time to access first word of a sequence of words tbl = Block time Time to access an entire block of wordstl + k/ access time (Information is often stored and moved in blocks at the cache and disk level.)

D a t a b u s A d d r e s s b u s C P U M a i n m e m o r y m s A d d r e s s m 0 M A R A – A 0 m – 1 w 1 D – D b 0 b – 1 2 M D R 3 w R / W R e g i s t e r f i l e m 2 – 1 R E Q U E S T C O M P L E T E C o n t r o l s i g n a l s –more– Fig 7.1 The CPU–Memory Interface • Sequence of events: • Read: • 1. CPU loads MAR, issues Read, and REQUEST • 2. Main memory transmits words to MDR • 3. Main memory asserts COMPLETE • Write: • 1. CPU loads MAR and MDR, asserts Write, and REQUEST • 2. Value in MDR is written into address specified by MAR • 3. Main memory asserts COMPLETE

Fig 7.1 The CPU–Memory Interface (cont’d.) D a t a b u s A d d r e s s b u s C P U M a i n m e m o r y m s A d d r e s s m 0 M A R A – A 0 m Ð 1 w 1 D – D b 0 b Ð 1 2 M D R 3 w R / W R e g i s t e r f i l e m 2 – 1 R E Q U E S T C O M P L E T E C o n t r o l s i g n a l s • Additional points: • If b < w, main memory must make w/b b-bit transfers • Some CPUs allow reading and writing of word sizes < w • Example: Intel 8088: m = 20, w = 16, s = b = 8 • 8- and 16-bit values can be read and written • If memory is sufficiently fast, or if its response is predictable,then COMPLETE may be omitted • Some systems use separate R and W lines, and omit REQUEST

Specific Example - the SRC Memory Interface R W ADDR Memory Subsystem SRC CPU Done MEMStrobe DATA • Memory Read Operation • CPU places address on ADDR bus and asserts Read line • Memory places valid data on the DATA bus and asserts Done and MEMStrobe • Memory Write Operation • CPU places address on ADDR bus, data on the DATA bus, and asserts Write line • Memory stores data and asserts Done

System Clock Valid data from memory to CPU DATA Valid address from CPU to memory ADDR Start of memory read operation R End of memory read operation W Done MEMStrobe 1 to n clock cycles SRC Memory Read Bus Cycle

System Clock DATA Valid data from CPU to memory Valid address from CPU to memory ADDR R Start of memory write operation End of memory write operation W Done MEMStrobe 1 to n clock cycles SRC Memory Write Bus Cycle

Wait States • Memory operations can take from 1 to n clock cycles • If the memory is fast enough to perform the operation in a single clock cycle, the COMPLETE signal (Done and MEMStrobe in the SRC) can be asserted in the same clock cycle • If the memory can not complete the request in a single clock cycle, it simply continues working on it and delays sending the COMPLETE signal. • If the memory operation takes n>1 clock cycles to complete, the memory is said to have inserted n-1 “wait states” into the operation • n=1 is a zero wait state access- this is usually not possible given the increasing speed of processors vs. the speed of memory

Zero Wait States (n=1) One Wait State (n=2) SRC Wait State Examples

Three Wait States (n=4) Many Wait States More SRC Wait State Examples

Addressing Memory - Decoding the Chip Selects • The main memory system is typically made up of “blocks” of memory implemented as physically separate units • Each block must be activated based on its portion of the address space of the CPU - this is accomplished using individual Chip Selects (CS) to activate each block • Activation of the CS is accomplished by decoding the address from the CPU - one of the functions of the memory controller Memory Subsystem CS R ROM W Memory Controller Done MEMStrobe CS SRC CPU RAM1 ADDR CS DATA RAM2

Building an Address Decoder • Determine how you want the various ROM and RAM memory modules arranged in the CPU’s address space • Calculate the starting and ending address for each block of memory • Given a starting address of s and a memory block size of k, the ending address, e = s + (k-1) • If the ending address of a previous memory block is e, the starting address of the next block, s = e+1 • Calculate the number of lower address bits that must be connected to the block to address the memory within it • Determine the unique pattern on the upper address bits that will signify activity on that block • Construct logic that will activate the CS for that block given the above pattern

Memory Map Addresses 0x00000000 32K ROM block 32K = 215 - 15 bits to address memory within this block 0x00007FFF 0x00008000 16K RAM block 16K = 214 - 14 bits to address memory within this block 0x0000BFFF 0x0000C000 16K RAM block 16K = 214 - 14 bits to address memory within this block 0x0000FFFF Building an Address Decoder - An Example • Consider a processor with a 32 bit address bus which uses word addressing connected to the following memory: • 32K word EEPROM starting at address 0x00000000 • 16K word RAM module above the EEPROM • a second 16KRAM module above the first

ADDR(31:15) CS_ROM1 to ROM1’s address lines ADDR(14:0) ADDR(31:16) ADDR(15) CS_RAM1 ADDR(14) to RAM1’s address lines ADDR(13:0) ADDR(31:16) ADDR(15) CS_RAM2 ADDR(14) to RAM2’s address lines ADDR(13:0) Address Decoding Example (cont.) • First block of ROM uses bits (14:0) to address it • Starts at address 0x00000000 and goes to 0X00007FFF • Selected when all bits above (14) are ‘0’ • First block of RAM uses bits (13:0) to address it • Starts at address 0x00008000 and goes to 0x0000BFFF • Selected when bits (32:16) are ‘0’, and bits (15:14) are “10” • Second block of RAM uses bits (13:0) to address it • Starts at address 0x0000C000 and goes to 0x0000FFFF • Selected when bits (32:16) are ‘0’, and bits (15:14) are “11”

Memory Map Addresses 0x0000 2K ROM block 2K = 211 - 11 bits to address memory within this block 0x07FF 0x0800 16K RAM block 16K = 214 - 14 bits to address memory within this block 0x47FF Another Address Decoding Example • Decoding is more complex when a larger block sits above a smaller one • Consider a processor with a 16 bit address bus which uses word addressing connected to the following memory: • 2K word EEPROM starting at address 0x00000000 • 16K word RAM module above the EEPROM

Another Decoding Example (cont.) • Selecting first ROM is simple • Selected when all bits above ADDR(10) are ‘0’ • Selecting (and addressing within) RAM is more complex • Starting address of RAM is 0x0800 - 0b 0000 1000 0000 0000 • However, ADDR(13:0) are used to address within the RAM and must be ‘0’ to address the first RAM location • Now we have to subtract 0b 1000 0000 0000 from ADDR(13:0) value to get proper address within RAM - too complex • Solution - adjust memory map so that all blocks are “aligned” on size of largest block Memory Map Addresses 0x0000 2K ROM block 2K = 211 - 11 bits to address memory within this block 0x07FF 14K block unused 0x4000 16K RAM block 16K = 214 - 14 bits to address memory within this block 0x7FFF

ADDR(15:11) CS_ROM1 to ROM1’s address lines ADDR(10:0) ADDR(15) CS_RAM1 ADDR(14) to RAM1’s address lines ADDR(13:0) Another Decoding Example (cont.) • First block of ROM uses bits (10:0) to address it • Starts at address 0x0000 and goes to 0X07FF • Selected when all bits above (10) are ‘0’ • First block of RAM uses bits (13:0) to address it • Starts at address 0x4000 and goes to 0x7FFF • Selected when bit (15) is ‘0’, and bit (14) is ‘1’

Additional Functions of the Memory Controller • As just discussed, the memory controller must decode the addresses from the CPU and select the proper memory block • In addition, the memory controller must provide the proper handshaking between the CPU and the memory • Most memory chips are asynchronous (don’t work on a clock) and do not have any outputs which tell when the data is valid - you simply have to apply the correct inputs and wait the proper amount of time • Some memory chips require a sequence of signals to be applied for different operations • Each CPU type had a different set of control signals and a different bus cycle • Finally, some memory controllers are required to perform complex address calculations to translate addresses from the CPU into a different format that the memory system requires

S e l e c t D a t a I n D a t a O u t R / W This “static” RAM cell is unrealistic in practice, but it is functionally correct. We will discuss more practical designs later. Fig 7.3 Conceptual Structure of a Memory Cell Regardless of the technology, all RAM memory cells must provide these four functions: Select, DataIn, DataOut, and R/W. Select  DataIn DataOut R/W

The entire register is selected with one select line, and uses one R/W line Fig 7.4 An 8-Bit Register as a1-D RAM Array Data bus is bidirectional and buffered

2-4 line decoder selects one of the four 8-bit arrays 2 – 4 D D D D D D D D d e c o d e r 2-bit address A 1 D D D D D D D D A 0 D D D D D D D D D D D D D D D D • This type of N to 2N decoder requires 2N gates, which is practical for only very small arrays R / W d d d d d d d d 0 1 2 3 4 5 6 7 R/W is common to all Bidirectional 8-bit buffered data bus Fig 7.5 A 4 x 8 2-D Memory Cell Array

Fig 7.6 A 64 K x 1 Static RAM Chip ~square array fits IC design paradigm R o w a d d r e s s : 8 2 5 6 8 – 2 5 6 A – A 2 5 6   2 5 6 0 7 r o w c e l l a r r a y d e c o d e r Selecting rows separately from columns means only 256 x 2 = 512 circuit elements instead of 65536 circuit elements! 2 5 6 8 C o l u m n a d d r e s s : 1 2 5 6 – 1 m u x A – A 1 1 – 2 5 6 d e m u x 8 1 5 CS, Chip Select, allows chips in arrays to be selected individually R / W 1 C S This chip requires 21 pins including power and ground, and so will fit in a 22-pin package.

Fig 7.7 A 16 K x 4 SRAM Chip R o w a d d r e s s : 8 2 5 6 8 – 2 5 6 4 6 4   2 5 6 A – A 0 7 r o w c e l l a r r a y s d e c o d e r There is little difference between this chip and the previous one, except that there are4 64-1 multiplexers instead of1 256-1 multiplexer. 6 4 e a c h 6 C o l u m n a d d r e s s : – 4 6 4 1 m u x e s A – A 4 1 – 6 4 d e m u x e s 8 1 3 R / W 4 C S This chip requires 24 pins including power and ground, and so will require a 24-pin package. Package size and pin count can dominate chip cost.

m m m m 0 4 8 1 2 m m m m m m 0 4 1 5 9 1 3 d x e 0 c m m 2 m m m m o – 1 5 2 6 1 0 1 4 d x 4 e 1 r d x e m m m m 2 c 0 m m o – 3 7 1 1 1 5 2 6 d 4 x e r 1 m m 3 7 2 – 4 d e c o d e r x x x x 2 2 2 3 Fig 7.8 Matrix and Tree Decoders • 2-level decoders are limited in size because of gate fan-in.Most technologies limit fan-in to ~8. • When decoders must be built with fan-in >8, then additional levelsof gates are required. • Tree and matrix decoders are two ways to design decoders with large fan-in: 4-to-16 line matrix decoder constructed from 2-input gates. 3-to-8 line tree decoder constructed from 2-input gates.

Fig 7.9Six-Transistor Static RAM Cell D u a l r a i l d a t a l i n e s f o r r e a d i n g a n d w r i t i n g b b i i A c t i v e + 5 l o a d s S t o r a g e c e l l W o r d l i n e w i This is a more practical design than the 8-gate design shown earlier. S w i t c h e s t o c o n t r o l a c c e s s t o c e l l A d d i t i o n a l c e l l s A value is read by precharging the bit lines to a value 1/2 way between a 0 and a 1, while asserting the word line. This allows the latch to drive the bit lines to the value stored in the latch. C o l u m n s e l e c t S e n s e / w r i t e a m p l i f i e r s — ( f r o m c o l u m n s e n s e a n d a m p l i f y d a t a a d d r e s s o n R e a d , d r i v e b a n d i d e c o d e r ) b o n w r i t e i R / W C S d i

M e m o r y a d d r e s s R e a d / w r i t e C S D a t a t A A Fig 7.10 Static RAM Read Operation Access time from Address—the time required of the RAM array to decode the address and provide value to the data bus.

Fig 7.11 Static RAM Write Operations M e m o r y a d d r e s s R e a d / w r i t e Data is latched on this rising edge C S D a t a t w Write time—the time the data must be held valid in order to decode address and store value in memory cells.

S w i t c h t o c o n t r o l b a c c e s s t o c e l l i S i n g l e b i t l i n e C a p a c i t o r s t o r e s c h a r g e f o r a 1 , c n o c h a r g e f o r t a 0 W o r d l i n e w j A d d i t i o n a l c e l l s C o l u m n s e l e c t ( f r o m c o l u m n S e n s e / w r i t e a m p l i f i e r s — a d d r e s s s e n s e a n d a m p l i f y d a t a d e c o d e r ) o n R e a d , d r i v e b a n d b i i o n w r i t e R / W R W C S d i Fig 7.12 Dynamic RAM Cell Organization Capacitor will discharge in 4–15 ms. Refresh capacitor by reading (sensing) value on bit line, amplifying it, and placing it back on bit line where it recharges capacitor. Write: place value on bit line and assert word line. Read: precharge bit line, assert word line, sense value on bit line with sense/amp. This need to refresh the storage cells of dynamic RAM chips complicates DRAM system design.

r e d o c e d 1 0 2 4 d n a 1 0 2 4  1 0 2 4 c e l l a r r a y s e h c t a l w o R 1 0 2 4 1 0 C o n t r o l A – A 1 0 2 4 s e n s e / w r i t e a m p l i f i e r s 0 9 a n d c o l u m n l a t c h e s R A S C o n t r o l 1 0 2 4 l o g i c C A S 1 0 1 0 c o l u m n a d d r e s s l a t c h e s , R / W 1 – 1 0 2 4 m u x e s a n d d e m u x e s d d o i Fig 7.13 Dynamic RAM Chip Organization • Addresses are time-multiplexed on address bus using RAS and CAS as strobes of rows and columns. • CAS is normally used as the CS function. • Notice pin counts: • Without address multiplexing: 27 pins including power and ground. • With address multiplexing: 17 pins including power and ground.

Figs 7.14, 7.15DRAM Read and Write Cycles Typical DRAM Read operation Typical DRAM Write operation M e m o r y M e m o r y R o w a d d r e s s C o l u m n a d d r e s s R o w a d d r e s s C o l u m n a d d r e s s a d d r e s s a d d r e s s t t t t R A S R A S R A S P r e c h g R A S p r e c h g C A S C A S R / W W D a t a D a t a t t A D H R t t C C Data hold from RAS. Access time Cycle time Notice that it is the bit line precharge operation that causes the difference between access time and cycle time.

DRAM Refresh and Row Access • Refresh is usually accomplished by a “RAS-only” cycle. The row address is placed on the address lines and RAS asserted. This refreshes the entire row. CAS is not asserted. The absence of a CAS phase signals the chip that a row refresh is requested, and thus no data is placed on the external data lines. • Many chips use “CAS before RAS” to signal a refresh. The chip has an internal counter, and whenever CAS is asserted before RAS, it is a signal to refresh the row pointed to by the counter, and to increment the counter. • Most DRAM vendors also supply one-chip DRAM controllers that encapsulate the refresh and other functions. • Page mode, nibble mode, and static column mode allow rapid access tothe entire row that has been read into the column latches. • Video RAMS, VRAMS, clock an entire row into a shift register where it canbe rapidly read out, bit by bit, for display.

+ v A d d r e s s 00 R o w d e c o d e r C S 1 0 1 0 ROM (Non-Volatile) Memory Fig 7.16 A 2-D CMOS ROM Chip

Tbl 7.4 ROM Types ROM Cost Programmability Time to Time to Erase TypeProgram Mask- Very At factory Weeks N/A programmed inexpensive only ROM PROM Inexpensive Once, by Seconds N/A end user EPROM Moderate Many times Seconds 20 minutes Flash Expensive Many times 100 s 1 s, large EPROM block EEPROM Very Many times 100 s 10 ms, expensive byte

Memory Boards and Modules • There is a need for memories that are larger and wider than a single chip • Chips can be organized into “boards.” • Boards may not be actual, physical boards, but may consist ofstructured chip arrays present on the motherboard. • A board or collection of boards make up a memory module. • Memory modules: • Satisfy the processor–main memory interface requirements • May have DRAM refresh capability • May expand the total main memory capacity • May be interleaved to provide faster access to blocks of words

Fig 7.17 General Structureof a Memory Chip This is a slightly different view of the memory chip than previous. Multiple chip selects ease the assembly of chips into chip arrays. Usually provided by an external AND gate. C h i p s e l e c t s . . . A d d r e s s R o w d e c o d e r m . . . R / W M e m o r y . . . c e l l C S C S a r r a y R / W m s A d d r e s s I / O s s m u l t i p l e x e r D a t a s s Data

All chips have common CS, R/W, and Address lines. S e l e c t A d d r e s s R / W C S C S C S . . . R / W R / W R / W A d d r e s s A d d r e s s A d d r e s s D a t a D a t a D a t a s s s p   s Fig 7.18 Word Assembly from Narrow Chips P chips expand word size from s bits to p x s bits.

A d d r e s s m + k k k t o 2 d e c o d e r k . . . m R / W C S C S C S R / W R / W R / W A d d r e s s A d d r e s s A d d r e s s D a t a D a t a D a t a s s s s Fig 7.19 Increasing the Number of Words by a Factor of 2k The additional k address bits are used to select one of 2k chips, each one of which has 2m words: Word size remains at s bits.

Fig 7.20 Chip Matrix Using Two Chip Selects A d d r e s s m + q + k H o r i z o n t a l d e c o d e r k m R / W C S 1 C S 2 R / W A d d r e s s q D a t a r l a e c d i t o r c e e V d Multiple chip select lines are used to replace the last level of gates in this matrix decoder scheme. This scheme simplifies the decoding from use of a (q+k)-bit decoder to using one q-bit and one k-bit decoder. s m + q + k O n e o f 2 s - b i t w o r d s

Fig 7.21Three-DimensionalDynamicRAM Array E n a b l e C A S k c k 2 d e c o d e r c . . . k + k c r H i g h k a d d r e s s r r r e e d d . . . . . . o o c c e e d d r r k k 2 2 R A S R / W M u l t i p l e x e d a d d r e s s m / 2 R A S C A S R A S C A S R / W R / W • CAS is used to enable top decoder in decoder tree. • Use one 2-D array for each bit. Each 2-D array on separate board. A d d r e s s A d d r e s s D a t a D a t a D a t a w R A S C A S R / W A d d r e s s D a t a

A d d r e s s Bus Interface: k + m A d d r e s s r e g i s t e r k m C h i p / b o a r d s e l e c t i o n M o d u l e M e m o r y b o a r d s s e l e c t a n d / o r c h i p s C o n t r o l R e a d s i g n a l g e n e r a t o r W r i t e R e a d y w D a t a r e g i s t e r D a t a w Fig 7.22 A Memory Module and Its Interface • Must provide— • Read and Write signals. • Ready: memory is ready to accept commands. • Address—to be sent with Read/Write command. • Data—sent with Write or available upon Read when Ready is asserted. • Module select—needed when there is more than one module. Control signal generator: for SRAM, just strobes data on Read, Provides Ready on Read/Write For DRAM—also provides CAS, RAS, R/W, multiplexes address, generates refresh signals, and provides Ready.

A d d r e s s k + m A d d r e s s r e g i s t e r m / 2 k C h i p / b o a r d s e l e c t i o n m / 2 m / 2 R e f r e s h c o u n t e r A d d r e s s R e f r e s h m u l t i p l e x e r c l o c k a n d m / 2 c o n t r o l 2 t B o a r d a n d A d d r e s s l i n e s h s s e t e c h i p s e l e c t s u n r f q a e r e M o d u l e R G R s e l e c t D y n a m i c R A S R A M a r r a y R e a d M e m o r y C A S t i m i n g g e n e r a t o r W r i t e R / W D a t a l i n e s w R e a d y D a t a r e g i s t e r w D a t a Fig 7.23 Dynamic RAM Modulewith Refresh Control

Memory System Performance Breaking the memory access process into steps: • For all accesses: • transmission of address to memory • transmission of control information to memory (R/W, Request, etc.) • decoding of address by memory • For a Read: • return of data from memory • transmission of completion signal • For a Write: • transmission of data to memory (usually simultaneous with address) • storage of data into memory cells • transmission of completion signal The next slide shows the access process in more detail.

C o m m a n d t o m e m o r y R e a d o r w r i t e A d d r e s s t o m e m o r y R e a d o r w r i t e A d d r e s s d e c o d e C o m p l e t e P r e c h a r g e R e a d o r w r i t e W r i t e d a t a t o m e m o r y W r i t e W r i t e d a t a W r i t e R e t u r n d a t a R e a d t a t c ( a ) S t a t i c R A M b e h a v i o r R e a d o r w r i t e P r e c h a r g e R o w a d d r e s s & R A S C o l u m n a d d r e s s & C A S C o m p l e t e R / W R e a d o r w r i t e R e t u r n d a t a R e a d W r i t e d a t a t o m e m o r y W r i t e R e f r e s h P r e c h a r g e P e n d i n g r e f r e s h C o m p l e t e t a t c ( b ) D y n a m i c R A M b e h a v i o r Fig 7.26 Sequence of Steps in Accessing Memory “Hidden refresh” cycle. A normal cycle would exclude the pending refresh step. -more-

Chapter 7: Memory System Design