270 likes | 431 Views
Computer Architecture Peripherals. By Dan Tsafrir, 6/6/2011 Presentation based on slides by Lihu Rappoport. Memory: reminder. 1000. CPU. 100. Performance. Gap grew 50% per year. 10. DRAM. 1. 1980. 1981. 1982. 1983. 1984. 1985. 1986. 1987. 1988. 1989. 1990. 1991. 1992. 1993.
E N D
Computer ArchitecturePeripherals By Dan Tsafrir, 6/6/2011Presentation based on slides by Lihu Rappoport
1000 CPU 100 Performance Gap grew 50% per year 10 DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time Not so long ago… CPU 60% per yr 2X in 1.5 yrs DRAM 9% per yr 2X in 10 yrs
Not so long ago… • In 1994, in their paper “Hitting the Memory Wall: Implications of the Obvious”, William Wulf & Sally McKee said:“We all know that the rate of improvement in microprocessor speed exceeds the rate of improvement in DRAM memory speed – each is improving exponentially, but the exponent for microprocessors is substantially larger than that for DRAMs. The difference between diverging exponentials also grows exponentially; so, although the disparity between processor and memory speed is already an issue, downstream someplace it will be a much bigger one.”
More recently (2008)… Fast Conventionalarchitecture lower = slower Performance (seconds) Processor cores Slow The memory wall in the multicore era
Memory Trade-Offs • Large (dense) memories are slow • Fast memories are small, expensive and consume high power • Goal: give the processor a feeling that it has a memory which is large (dense), fast, consumes low power, and cheap • Solution: a Hierarchy of memories Speed: Fastest Slowest Size: Smallest Biggest Cost: Highest Lowest Power: Highest Lowest L3 Cache L1 Cache Memory (DRAM) CPU L2 Cache
DRAM basics • DRAM • Dynamic random-access memory • Random access = access cost the same (well, not really) • CPU thinks of DRAM as 1-dimensional • Simpler • But DRAM is actually arranged as a 2-D grid • Need row & col addresses to access • Given “1-D address”, DRAM interface splits it to row & col • Some time duration must elapse between row & col access(10s of ns)
DRAM basics • Why 2D? Why delayed row & col accesses? • Every address-bit requires a physical pin • DRAMs are large (GBs nowadays)=> need many pins => more expensive • A DRAM array has • Row decoder • Extracts row number from memory address • Column decoder • Extracts column number from memory address • Sense amplifiers • Hold row when (1) written to, (2) read from, (3) is refreshed (see next slide)
DRAM basics • Use one transistor-capacitor pair • Per bit • Capacitors leaks • => Need to be refreshed every few ms • DRAM spends ~1% of time in refreshing • “Opening” a row • = fetching it to sense amplifiers • = refreshing it • Is it worth it to make DRAM a rectangle (rather than square?)
x1 DRAM Column decoder Data in/out buffers Senseamplifiers one bit Memoryarray Rowdecoder …rows… …columns…
DRAM banks • Each DRAM memory array outputs one bit • DRAMs use multiple arrays to output multiple bits at a time • x N indicates DRAM with N memory arrays • Typical today: x16, x32 • Each collection of x N arrays forms a DRAM bank • Can read/write from/to each bank independently
x4 DRAM Column decoder Column decoder Column decoder Column decoder Data in/out buffers Data in/out buffers Data in/out buffers Data in/out buffers Senseamplifiers Senseamplifiers Senseamplifiers Senseamplifiers one bit Memoryarray Memoryarray Memoryarray Memoryarray Rowdecoder Rowdecoder Rowdecoder Rowdecoder …row… …rows… …row… …row… …columns… …columns… …columns… …columns…
Ranks & DIMMs • DIMM • (Dual in-line) memory module (the unit we connect to the MB) • Increase bandwidth by delivering data from multiple banks • Bandwidth by one bank is limited • => Put multiple banks on DIMM • Bus has higher clock frequency than any one DRAM • Bus controls switches between banks to achieve high data rate • Increase capacity by utilizing multiple ranks • Each rank is an independent set of banks that can be accessed for the full data bit‐width, • 64 bits for non-ECC; 72 for ECC (error correction code) • Ranks cannot be accessed simultaneously • As they share the same data path
Ranks & DIMMs 1GB 2Rx8 (= 2ranks x 8 banks)
Modern DRAM organization • A system has multiple DIMMs • Each DIMM has multiple DRAM banks • Arranged in one or more ranks • Each bank has multiple DRAM arrays • Concurrency in banks increases memory bandwidth
Memory controller address/command bus data bus chip select 1 Memorycontroller address/command bus data bus chip select 2
Memory controller • Functionality • Executes processor memory requests • In earlier systems • Separate off-processor chip • In modern systems • Integrated on-chip with the processor • Interconnect with processor • Bus, but can be point-to-point, or through crossbar
Lifetime of a memory access • Processor orders & queues memory requests • Request(s) sent to memory controller • Controller queues & orders requests • For each request in queue, when the time is right • Controller waits until requested DRAM ready • Controller breaks address bits into rank, bank, row, column fields • Controller sends chip-select signal to select rank • Selected bank pre-charged to activate selected row • Activate row within selected DRAM bank • Use “RAS” (row-address strobe signal) • Send (entire) row to sense amplifiers • Select desired column • Use “CAS” (column-address strobe signal) • Send data back
Memory address bus CAS# Column latch RAS# Data Column addr decoder Row address decoder Memory array Row latch Addr Basic DRAM array • Timing (2 phases) • Decode row address + RAS assert • Wait for “RAS to CAS delay” • Decode column address + CAS assert • Transfer DATA
DRAM timing • CAS Latency • Number of clock cycles to access a specific column of data • From moment the memory controller issues a column in the current row until data is read out from memory • RAS to CAS delay • Number of cycles between row and column access • Row pre-charge time • Number of cycles to close the opened-row & to open next-row
prechargedelay access time RAS# RAS/CAS delay CAS# A[0:7] Row j X Row i Col n X CAS latency Data Data n Addressing sequence • Access sequence • Put row address on data bus and assert RAS# • Wait for RAS# to CAS# delay (tRCD) • Put column address on data bus and assert CAS# • DATA transfer • Pre-charge
Improved DRAM Schemes • Paged Mode DRAM • Multiple accesses to different columns from same row (special locality) • Saves time it takes to bring a new row (but might be unfair) • Extended Data Output RAM (EDO RAM) • A data output latch enables to parallelize next column address with current column data RAS# CAS# A[0:7] X Row X Col n X Col n+2 X Col n+1 X D n+2 Data Data n D n+1 RAS# CAS# A[0:7] X Row X Col n X Col n+2 X Col n+1 X Data n+2 Data Data n Data n+1
Improved DRAM Schemes (cont) • Burst DRAM • Generates consecutive column address by itself RAS# CAS# A[0:7] Row X Col n X X Data n+2 Data Data n Data n+1
Synchronous DRAM (SDRAM) • Asynchrony in DRAM • Due to RAS & CAS arriving at any time • Synchronous DRAM • Uses clock to deliver requests at regular intervals • More predictable DRAM timing • => Less skew • => Faster turnaround • SDRAMs support burst-mode access • Initial performance similar to BEDO (=burst +EDO) • Clock scaling enabled higher transfer rates later • => DDR SDRAM => DDR2 => DDR3
DRAM vs. SRAM (Random access = access time the same for all locations)