1 / 68

CS 505: Computer Structures Memory and Disk I/O

CS 505: Computer Structures Memory and Disk I/O. Thu D. Nguyen Spring 2005 Computer Science Rutgers University. Main Memory Background. Performance of Main Memory: Latency: Cache Miss Penalty Access Time: time between request and word arrives Cycle Time: time between requests

erling
Download Presentation

CS 505: Computer Structures Memory and Disk I/O

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 505: Computer StructuresMemory and Disk I/O Thu D. Nguyen Spring 2005 Computer Science Rutgers University

  2. Main Memory Background • Performance of Main Memory: • Latency: Cache Miss Penalty • Access Time: time between request and word arrives • Cycle Time: time between requests • Bandwidth: I/O & Large Block Miss Penalty (L2) • Main Memory is DRAM: Dynamic Random Access Memory • Dynamic since needs to be refreshed periodically (8 ms) • Addresses divided into 2 halves (Memory as a 2D matrix): • RAS or Row Access Strobe • CAS or Column Access Strobe • Cache uses SRAM: Static Random Access Memory • No refresh (6 transistors/bit vs. 1 transistor) • Size: DRAM/SRAM ­ 4-8 • Cost/Cycle time: SRAM/DRAM ­ 8-16

  3. DRAM logical organization (4 Mbit) Column Decoder … D Sense Amps & I/O 1 1 Q Memory Array A0…A1 0 (2,048 x 2,048) Storage W ord Line Cell

  4. 4 Key DRAM Timing Parameters • tRAC: minimum time from RAS line falling to the valid data output. • Quoted as the speed of a DRAM when buy • A typical 4Mb DRAM tRAC = 60 ns • tRC: minimum time from the start of one row access to the start of the next. • tRC = 110 ns for a 4Mbit DRAM with a tRAC of 60 ns • tCAC: minimum time from CAS line falling to valid data output. • 15 ns for a 4Mbit DRAM with a tRAC of 60 ns • tPC: minimum time from the start of one column access to the start of the next. • 35 ns for a 4Mbit DRAM with a tRAC of 60 ns

  5. DRAM Performance • A 60 ns (tRAC) DRAM can • perform a row access only every 110 ns (tRC) • perform column access (tCAC) in 15 ns, but time between column accesses is at least 35 ns (tPC). • In practice, external address delays and turning around buses make it 40 to 50 ns • These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead!

  6. DRAM History • DRAMs: capacity +60%/yr, cost –30%/yr • 2.5X cells/area, 1.5X die size in ­3 years • ‘98 DRAM fab line costs $2B • DRAM only: density, leakage v. speed • Rely on increasing no. of computers & memory per computer (60% market) • SIMM or DIMM is replaceable unit => computers use any generation DRAM • Commodity, second source industry => high volume, low profit, conservative • Little organization innovation in 20 years • Order of importance: 1) Cost/bit 2) Capacity • First RAMBUS: 10X BW, +30% cost => little impact

  7. More esoteric Storage Technologies? • Tunneling Magnetic Junction RAM (TMJ-RAM): • Speed of SRAM, density of DRAM, non-volatile (no refresh) • New field called “Spintronics”: combination of quantum spin and electronics • Same technology used in high-density disk-drives • MEMs storage devices: • Large magnetic “sled” floating on top of lots of little read/write heads • Micromechanical actuators move the sled back and forth over the heads

  8. MEMS-based Storage • Magnetic “sled” floats on array of read/write heads • Approx 250 Gbit/in2 • Data rates:IBM: 250 MB/s w 1000 headsCMU: 3.1 MB/s w 400 heads • Electrostatic actuators move media around to align it with heads • Sweep sled ±50m in < 0.5s • Capacity estimated to be in the 1-10GB in 10cm2 See Ganger et all: http://www.lcs.ece.cmu.edu/research/MEMS

  9. Main Memory Performance • Simple: • CPU, Cache, Bus, Memory same width (32 or 64 bits) • Wide: • CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; UtraSPARC 512) • Interleaved: • CPU, Cache, Bus 1 word: Memory N Modules(4 Modules); example is word interleaved

  10. Main Memory Performance • Timing model (word size is 32 bits) • 1 to send address, • 6 access time, 1 to send data • Cache Block is 4 words • Simple M.P. = 4 x (1+6+1) = 32 • Wide M.P. = 1 + 6 + 1 = 8 • Interleaved M.P. = 1 + 6 + 4x1 = 11

  11. How Many Banks? • Number of banks  Number of clock cycles to access word in bank • otherwise will return to original bank before it can have next word ready • Increasing DRAM size => fewer chips => harder to have banks

  12. 32 8 8 2 4 1 8 2 4 1 8 2 DRAMs per PC over Time DRAM Generation ‘86 ‘89 ‘92 ‘96 ‘99 ‘02 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB 16 4 Minimum Memory Size

  13. Avoiding Bank Conflicts • Lots of banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; • Even with 128 banks, since 512 is multiple of 128, conflict on word accesses • SW: loop interchange or declaring array not power of 2 (“array padding”) • HW: Prime number of banks • bank number = address mod number of banks • address within bank = address / number of words in bank • modulo & divide per memory access with prime no. banks? • address within bank = address mod number words in bank • bank number? easy if 2N words per bank

  14. Fast Memory Systems: DRAM specific • Multiple CAS accesses: several names (page mode) • Extended Data Out (EDO): 30% faster in page mode • New DRAMs to address gap; what will they cost, will they survive? • RAMBUS: startup company; reinvent DRAM interface • Each Chip a module vs. slice of memory • Short bus between CPU and chips • Does own refresh • Variable amount of data returned • 1 byte / 2 ns (500 MB/s per chip) • Synchronous DRAM: 2 banks on chip, a clock signal to DRAM, transfer synchronous to system clock (66 - 150 MHz) • Niche memory or main memory? • e.g., Video RAM for frame buffers, DRAM + fast serial output

  15. PotentialDRAM Crossroads? • After 20 years of 4X every 3 years, running into wall? (64Mb - 1 Gb) • How can keep $1B fab lines full if buy fewer DRAMs per computer? • Cost/bit –30%/yr if stop 4X/3 yr? • What will happen to $40B/yr DRAM industry?

  16. Main Memory Summary • Wider Memory • Interleaved Memory: for sequential or independent accesses • Avoiding bank conflicts: SW & HW • DRAM specific optimizations: page mode & Specialty DRAM • DRAM future less rosy?

  17. Virtual Memory: TB (TLB) CPU CPU CPU VA VA VA VA Tags PA Tags $ TB $ TB VA PA PA L2 $ TB $ MEM PA PA MEM MEM Overlap $ access with VA translation: requires $ index to remain invariant across translation Conventional Organization Virtually Addressed Cache Translate only on miss Synonym Problem

  18. 2. Fast hits by Avoiding Address Translation • Send virtual address to cache? Called Virtually Addressed Cacheor just Virtual Cache vs. Physical Cache • Every time process is switched logically must flush the cache; otherwise get false hits • Cost is time to flush + “compulsory” misses from empty cache • Dealing with aliases(sometimes called synonyms); Two different virtual addresses map to same physical address • I/O must interact with cache, so need virtual address • Solution to aliases • One possible solution in Wang et al.’s paper • Solution to cache flush • Addprocess identifier tagthat identifies process as well as address within process: can’t get a hit if wrong process

  19. 2. Fast Cache Hits by Avoiding Translation: Process ID impact • Black is uniprocess • Light Gray is multiprocess when flush cache • Dark Gray is multiprocess when use Process ID tag • Y axis: Miss Rates up to 20% • X axis: Cache size from 2 KB to 1024 KB

  20. 2. Fast Cache Hits by Avoiding Translation: Index with Physical Portion of Address • If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag • Limits cache to page size: what if want bigger caches and uses same trick? • Higher associativity one solution Page Address Page Offset Address Tag Block Offset Index

  21. Alpha 21064 • Separate Instr & Data TLB & Caches • TLBs fully associative • TLB updates in SW(“Priv Arch Libr”) • Caches 8KB direct mapped, write thru • Critical 8 bytes first • Prefetch instr. stream buffer • 2 MB L2 cache, direct mapped, WB (off-chip) • 256 bit path to main memory, 4 x 64-bit modules • Victim Buffer: to give read priority over write • 4 entry write buffer between D$ & L2$ Instr Data Write Buffer Stream Buffer Victim Buffer

  22. Alpha Memory Performance: Miss Rates of SPEC92 8K 8K 2M

  23. Alpha CPI Components • Instruction stall: branch mispredict (green); • Data cache (blue); Instruction cache (yellow); L2$ (pink) Other: compute + reg conflicts, structural conflicts

  24. Pitfall: Predicting Cache Performance from Different Prog. (ISA, compiler, ...) • 4KB Data cache miss rate 8%,12%, or 28%? • 1KB Instr cache miss rate 0%,3%,or 10%? • Alpha vs. MIPS for 8KB Data $:17% vs. 10% • Why 2X Alpha v. MIPS? D$, Tom D$, gcc D$, esp I$, gcc I$, esp I$, Tom

  25. Pitfall: Simulating Too Small an Address Trace I$ = 4 KB, B=16B D$ = 4 KB, B=16B L2 = 512 KB, B=128B MP = 12, 200

  26. Main Memory Summary • Wider Memory • Interleaved Memory: for sequential or independent accesses • Avoiding bank conflicts: SW & HW • DRAM specific optimizations: page mode & Specialty DRAM • DRAM future less rosy?

  27. Outline • Disk Basics • Disk History • Disk options in 2000 • Disk fallacies and performance • Tapes • RAID

  28. Inner Track Outer Track Sector Head Arm Platter Actuator Disk Device Terminology • Several platters, with information recorded magnetically on both surfaces (usually) • Bits recorded in tracks, which in turn divided into sectors (e.g., 512 Bytes) • Actuator moves head (end of arm,1/surface) over track (“seek”), select surface, wait for sector rotate under head, then read or write • “Cylinder”: all tracks under heads

  29. { Platters (12) Photo of Disk Head, Arm, Actuator Spindle Arm Head Actuator

  30. Disk Device Performance Inner Track Outer Track Sector Head Controller Arm Spindle • Disk Latency = Seek Time + Rotation Time + Transfer Time + Controller Overhead • Seek Time? depends no. tracks move arm, seek speed of disk • Rotation Time? depends on speed disk rotates, how far sector is from head • Transfer Time? depends on data rate (bandwidth) of disk (bit density), size of request Platter Actuator

  31. Disk Device Performance • Average distance sector from head? • 1/2 time of a rotation • 7200 Revolutions Per Minute = 120 Rev/sec • 1 revolution = 1/120 sec = 8.33 milliseconds • 1/2 rotation (revolution) = 4.16 ms • Average no. tracks move arm? • Sum all possible seek distances from all possible tracks / # possible • Assumes average seek distance is random • Disk industry standard benchmark

  32. Data Rate: Inner vs. Outer Tracks • To keep things simple, orginally kept same number of sectors per track • Since outer track longer, lower bits per inch • Competition  decided to keep BPI the same for all tracks (“constant bit density”) • More capacity per disk • More of sectors per track towards edge • Since disk spins at constant speed, outer tracks have faster data rate • Bandwidth outer track 1.7X inner track!

  33. Response time = Queue + Controller + Seek + Rot + Xfer Service time Devices: Magnetic Disks Track Sector • Purpose: • Long-term, nonvolatile storage • Large, inexpensive, slow level in the storage hierarchy • Characteristics: • Seek Time (~8 ms avg) • Transfer rate • 10-30 MByte/sec • Blocks • Capacity • Gigabytes • Quadruples every 3 years (aerodynamics) Cylinder Platter Head 7200 RPM = 120 RPS => 8 ms per rev ave rot. latency = 4 ms 128 sectors per track => 0.25 ms per sector 1 KB per sector => 16 MB / s

  34. Historical Perspective • 1956 IBM Ramac — early 1970s Winchester • Developed for mainframe computers, proprietary interfaces • Steady shrink in form factor: 27 in. to 14 in. • 1970s developments • 5.25 inch floppy disk formfactor (microcode into mainframe) • early emergence of industry standard disk interfaces • ST506, SASI, SMD, ESDI • Early 1980s • PCs and first generation workstations • Mid 1980s • Client/server computing • Centralized storage on file server • accelerates disk downsizing: 8 inch to 5.25 inch • Mass market disk drives become a reality • industry standards: SCSI, IPI, IDE • 5.25 inch drives for standalone PCs, End of proprietary interfaces

  35. Disk History Data density Mbit/sq. in. Capacity of Unit Shown Megabytes 1973: 1. 7 Mbit/sq. in 140 MBytes 1979: 7. 7 Mbit/sq. in 2,300 MBytes source: New York Times, 2/23/98, page C3, “Makers of disk drives crowd even more data into even smaller spaces”

  36. Historical Perspective • Late 1980s/Early 1990s: • Laptops, notebooks, (palmtops) • 3.5 inch, 2.5 inch, (1.8 inch formfactors) • Formfactor plus capacity drives market, not so much performance • Recently Bandwidth improving at 40%/ year • Challenged by DRAM, flash RAM in PCMCIA cards • still expensive, Intel promises but doesn’t deliver • unattractive MBytes per cubic inch • Optical disk fails on performance but finds niche (CD ROM)

  37. Disk History 1989: 63 Mbit/sq. in 60,000 MBytes 1997: 1450 Mbit/sq. in 2300 MBytes 1997: 3090 Mbit/sq. in 8100 MBytes source: New York Times, 2/23/98, page C3, “Makers of disk drives crowd even mroe data into even smaller spaces”

  38. 1 inch disk drive! • 2000 IBM MicroDrive: • 1.7” x 1.4” x 0.2” • 1 GB, 3600 RPM, 5 MB/s, 15 ms seek • Digital camera, PalmPC? • 2006 MicroDrive? • 9 GB, 50 MB/s! • Assuming it finds a niche in a successful product • Assuming past trends continue

  39. Disk Performance Model /Trends • Capacity • + 100%/year (2X / 1.0 yrs) • Transfer rate (BW) • + 40%/year (2X / 2.0 yrs) • Rotation + Seek time • – 8%/ year (1/2 in 10 yrs) • MB/$ • > 100%/year (2X / <1.5 yrs) • Fewer chips + areal density

  40. Latency = Queuing Time + Controller time + Seek Time + Rotation Time + Size / Bandwidth { per access + per byte State of the Art: Ultrastar 72ZX • 73.4 GB, 3.5 inch disk • 2¢/MB • 10,000 RPM; 3 ms = 1/2 rotation • 11 platters, 22 surfaces • 15,110 cylinders • 7 Gbit/sq. in. areal den • 17 watts (idle) • 0.1 ms controller time • 5.3 ms avg. seek • 50 to 29 MB/s(internal) Track Sector Cylinder Track Buffer Arm Platter Head source: www.ibm.com; www.pricewatch.com; 2/14/00

  41. Disk Performance Example • Calculate time to read 1 sector (512B) for UltraStar 72 using advertised performance; sector is on outer track • Disk latency = average seek time + average rotational delay + transfer time + controller overhead • = 5.3 ms + 0.5 * 1/(10000 RPM) + 0.5 KB / (50 MB/s) + 0.15 ms • = 5.3 ms + 0.5 /(10000 RPM/(60000ms/M)) + 0.5 KB / (50 KB/ms) + 0.15 ms • = 5.3 + 3.0 + 0.10 + 0.15 ms = 8.55 ms

  42. Areal Density • Bits recorded along a track • Metric is Bits Per Inch (BPI) • Number of tracks per surface • Metric is Tracks Per Inch (TPI) • Care about bit density per unit area • Metric is Bits Per Square Inch • Called Areal Density • Areal Density = BPI x TPI

  43. Areal Density • Areal Density =BPI x TPI • Change slope 30%/yr to 60%/yr about 1991

  44. Disk Characteristics in 2000

  45. Disk Characteristics in 2000

  46. Disk Characteristics in 2000

  47. Disk Characteristics in 2000

  48. Technology Trends Disk Capacity now doubles every 12 months; before 1990 every 36 motnhs • Today: Processing Power Doubles Every 18 months • Today: Memory Size Doubles Every 18-24 months(4X/3yr) • Today: Disk Capacity Doubles Every 12-18 months • Disk Positioning Rate (Seek + Rotate) Doubles Every Ten Years! The I/O GAP

  49. Fallacy: Use Data Sheet “Average Seek” Time • Manufacturers needed standard for fair comparison (“benchmark”) • Calculate all seeks from all tracks, divide by number of seeks => “average” • Real average would be based on how data laid out on disk, where seek in real applications, then measure performance • Usually, tend to seek to tracks nearby, not to random track • Rule of Thumb: observed average seek time is typically about 1/4 to 1/3 of quoted seek time (i.e., 3X-4X faster) • UltraStar 72 avg. seek: 5.3 ms => 1.7 ms

  50. Fallacy: Use Data Sheet Transfer Rate • Manufacturers quote the speed of the data rate off the surface of the disk • Sectors contain an error detection and correction field (can be 20% of sector size) plus sector number as well as data • There are gaps between sectors on track • Rule of Thumb: disks deliver about 3/4 of internal media rate (1.3X slower) for data • For example, UlstraStar 72 quotes 50 to 29 MB/s internal media rate • => Expect 37 to 22 MB/s user data rate

More Related