Introduction to Hardware/Architecture

Introduction to Hardware/Architecture David A. Patterson http://cs.berkeley.edu/~patterson/talks patterson@cs.berkeley.edu EECS, University of California Berkeley, CA 94720-1776

Technology Trends: Microprocessor Capacity Alpha 21264: 15 million Pentium Pro: 5.5 million PowerPC 620: 6.9 million Alpha 21164: 9.3 million Sparc Ultra: 5.2 million Moore’s Law 2X transistors/Chip Every 1.5 years Called “Moore’s Law”:

Technology Trends: Processor Performance 1.54X/yr Processor performance increase/yr mistakenly referred to as Moore’s Law (transistors/chip)

5 components of any Computer Keyboard, Mouse Computer Processor (active) Devices Memory (passive) (where programs, data live when running) Input Control (“brain”) Disk, Network Output Datapath (“brawn”) Display, Printer

Computer Technology=>Dramatic Change • Processor • 2X in speed every 1.5 years; 1000X performance in last 15 years • Memory • DRAM capacity: 2x / 1.5 years; 1000X size in last 15 years • Cost per bit: improves about 25% per year • Disk • capacity: > 2X in size every 1.5 years • Cost per bit: improves about 60% per year • 120X size in last decade • State-of-the-art PC “when you graduate” (1997-2001) • Processor clock speed: 1500 MegaHertz (1.5 GigaHertz) • Memory capacity: 500 MegaByte (0.5 GigaBytes) • Disk capacity: 100 GigaBytes (0.1 TeraBytes) • New units! Mega => Giga, Giga => Tera

Integrated Circuit Costs Die cost = Wafer cost Dies per Wafer * Die yield Dies Flaws Die Cost is goes roughly with the cube of the area: fewer dies per wafer * yield worse with die area

Die Yield (1993 data) • Raw Dices Per Wafer • wafer diameter die area (mm2)100 144 196 256 324 400 • 6”/15cm 139 90 62 44 32 23 • 8”/20cm 265 177 124 90 68 52 • 10”/25cm 431 290 206 153 116 90 • die yield 23% 19% 16% 12% 11% 10% • typical CMOS process:  =2, wafer yield=90%, defect density=2/cm2, 4 test sites/wafer • Good Dices Per Wafer (Before Testing!) • 6”/15cm 31 16 9 5 3 2 • 8”/20cm 59 32 19 11 7 5 • 10”/25cm 96 53 32 20 13 9 • typical cost of an 8”, 4 metal layers, 0.5um CMOS wafer: ~$2000

1993 Real World Examples Chip Metal Line Wafer Defect Area Dies/ Yield Die Cost layers width cost /cm2 mm2 wafer 386DX 2 0.90 $900 1.0 43 360 71% $4 486DX2 3 0.80 $1200 1.0 81 181 54% $12 PowerPC 601 4 0.80 $1700 1.3 121 115 28% $53 HP PA 7100 3 0.80 $1300 1.0 196 66 27% $73 DEC Alpha 3 0.70 $1500 1.2 234 53 19% $149 SuperSPARC 3 0.70 $1700 1.6 256 48 13% $272 Pentium 3 0.80 $1500 1.5 296 40 9% $417 From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15

Processor Trends/ History • History of innovations to 2X / 1.5 yr • Pipelining (helps seconds / clock, or clock rate) • Out-of-Order Execution (helps clocks / instruction) • Superscalar (helps clocks / instruction)

Pipelining is Natural! A B C D • Laundry Example • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, fold, and put away • Washer takes 30 minutes • Dryer takes 30 minutes • “Folder” takes 30 minutes • “Stasher” takes 30 minutesto put clothes into drawers

Sequential Laundry 2 AM 12 6 PM 1 8 7 11 10 9 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 T a s k O r d e r Sequential laundry takes 8 hours for 4 loads Time A B C D

Pipelined Laundry: Start work ASAP 2 AM 12 6 PM 1 8 7 11 10 9 Time Pipelined laundry takes 3.5 hours for 4 loads! 30 30 30 30 30 30 30 T a s k O r d e r A B C D

30 30 30 30 30 30 30 bubble D Pipeline Hazard: Stall 2 AM 12 6 PM 1 8 7 11 10 9 Time A depends on D; stall since folder tied up T a s k O r d e r A B C E F

bubble Out-of-Order Laundry: Don’t Wait 2 AM 12 6 PM 1 8 7 11 10 9 Time A depends on D; rest continue; need more resources to allow out-of-order 30 30 30 30 30 30 30 T a s k O r d e r A B C D E F

30 30 30 30 30 (light clothing) A (dark clothing) B (very dirty clothing) C (light clothing) (dark clothing) (very dirty clothing) Superscalar Laundry: Parallel per stage 2 AM 12 6 PM 1 8 7 11 10 9 Time More resources, HW match mix of parallel tasks? T a s k O r d e r D E F

(light clothing) A D B C Superscalar Laundry: Mismatch Mix 2 AM 12 6 PM 1 8 7 11 10 9 Time Task mix underutilizes extra resources 30 30 30 30 30 30 30 T a s k O r d e r (light clothing) (dark clothing) (light clothing)

State of the Art: Alpha 21264 • 15M transistors • 2 64KB caches on chip; 16MB L2 cache off chip • Clock <1.7 nsec, or >600 MHz • 90 watts • Superscalar: fetch up to 6 instructions/clock cycle, retires up to 4 instruction/clock cycle • Execution out-of-order

Other example: Sony Playstation 2 • Emotion Engine: 6.2 GFLOPS, 75 million polygons per second (Microprocessor Report, 13:5) • Superscalar MIPS core + vector coprocessor + graphics/DRAM • Claim: “Toy Story” realism brought to games

The Goal: Illusion of large, fast, cheap memory • Fact: Large memories are slow, fast memories are small • How do we create a memory that is large, cheap and fast (most of the time)? • Hierarchy of Levels • Similar to Principle of Abstraction: hide details of multiple levels

Hierarchy Analogy: Term Paper • Working on paper in library at a desk • Option 1: Every time need a book • Leave desk to go to shelves (or stacks) • Find the book • Bring one book back to desk • Read section interested in • When done with section, leave desk and go to shelves carrying book • Put the book back on shelf • Return to desk to work • Next time need a book, go to first step

Hierarchy Analogy: Library • Option 2: Every time need a book • Leave some books on desk after fetching them • Only go to shelves when need a new book • When go to shelves, bring back related books in case you need them; sometimes you’ll need to return books not used recently to make space for new books on desk • Return to desk to work • When done, replace books on shelves, carrying as many as you can per trip • Illusion: whole library on your desktop • Buzzword “cache” from French for hidden treasure

Probability of reference 0 2^n - 1 Address Space Why Hierarchy works: Natural Locality • The Principle of Locality: • Program access a relatively small portion of the address space at any instant of time. • What programming constructs lead to Principle of Locality?

Memory Hierarchy: How Does it Work? • Temporal Locality (Locality in Time):  Keep most recently accessed data items closer to the processor • Library Analogy: Recently read books are kept on desk • Block is unit of transfer (like book) • Spatial Locality (Locality in Space):  Move blocks consists of contiguous words to the upper levels • Library Analogy: Bring back nearby books on shelves when fetch a book; hope that you might need it later for your paper

Central Processor Unit (CPU) Increasing Distance from CPU,Decreasing cost / MB “Upper” Level 1 Level 2 Level 3 “Lower” . . . Size of memory at each level Memory Hierarchy Pyramid Levels in memory hierarchy Level n (data cannot be in level i unless also in i+1)

Big Idea of Memory Hierarchy • Temporal locality: keep recently accessed data items closer to processor • Spatial locality: moving contiguous words in memory to upper levels of hierarchy • Uses smaller and faster memory technologies close to the processor • Fast hit time in highest level of hierarchy • Cheap, slow memory furthest from processor • If hit rate is high enough, hierarchy has access time close to the highest (and fastest) level and size equal to the lowest (and largest) level

Disk Description / History Track Embed. Proc. (ECC, SCSI) Sector Track Buffer Arm Head Platter 1973: 1. 7 Mbit/sq. in 140 MBytes 1979: 7. 7 Mbit/sq. in 2,300 MBytes Cylinder source: New York Times, 2/23/98, page C3, “Makers of disk drives crowd even more data into even smaller spaces”

Disk History 2000: 10,100 Mb/s. i. 25,000 MBytes 2000: 11,000 Mb/s. i. 73,400 MBytes 1989: 63 Mbit/sq. in 60,000 MBytes 1997: 1450 Mbit/sq. in 2300 Mbytes (2.5” diameter) 1997: 3090 Mbit/s. i. 8100 Mbytes (3.5” diameter) source: N.Y. Times, 2/23/98, page C3

Latency = Queuing Time + Controller time + Seek Time + Rotation Time + Size / Bandwidth { per access + per byte State of the Art: Ultrastar 72ZX Embed. Proc. Track • 73.4 GB, 3.5 inch disk • 2¢/MB • 16 MB track buffer • 11 platters, 22 surfaces • 15,110 cylinders • 7 Gbit/sq. in. areal density • 17 watts (idle) • 0.1 ms controller time • 5.3 ms avg. seek (seek 1 track => 0.6 ms) • 3 ms = 1/2 rotation • 37 to 22 MB/s to media Sector Cylinder Track Buffer Arm Platter Head source: www.ibm.com; www.pricewatch.com; 2/14/00

A glimpse into the future? • IBM microdrive for digital cameras • 340 Mbytes • Disk target in 5-7 years?

Questions? Contact us if you’re interested:email: patterson@cs.berkeley.eduhttp://iram.cs.berkeley.edu/

Introduction to Hardware/Architecture

Introduction to Hardware/Architecture

Presentation Transcript