14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 12 Buses and I/O system

14:332:331Computer Architecture and Assembly LanguageFall 2003Week 12Buses and I/O system [Adapted from Dave Patterson’s UCB CS152 slides and Mary Jane Irwin’s PSU CSE331 slides]

Head’s Up • This week’s material • Buses: Connecting I/O devices • Reading assignment – PH 8.4 • Memory hierarchies • Reading assignment – PH 7.1 and B.5 • Reminders • Next week’s material • Basics of caches • Reading assignment – PH 7.2

Review: Major Components of a Computer Processor Devices Control Output Memory Datapath Input Cache Main Memory Secondary Memory (Disk)

Input and Output Devices • I/O devices are incredibly diverse wrt • Behavior • Partner • Data rate

Magnetic Disk • Purpose • Long term, nonvolatile storage • Lowest level in the memory hierarchy • slow, large, inexpensive • General structure • A rotating platter coated with a magnetic surface • Use a moveable read/write head to access the disk • Advantages of hard disks over floppy disks • Platters are more rigid (metal or glass) so they can be larger • Higher density because it can be controlled more precisely • Higher data rate because it spins faster • Can incorporate more than one platter

Organization of a Magnetic Disk • Typical numbers (depending on the disk size) • 1 to 15 (2 surface) platters per disk with 1” to 8” diameter • 1,000 to 5,000 tracks per surface • 63 to 256 sectors per track • the smallest unit that can be read/written (typically 512 to 1,024 B) • Traditionally all tracks have the same number of sectors • Newer disks with smart controllers can record more sectors on the outer tracks (constant bit density) Sector Platters Track

Track Sector Cylinder Platter Head Magnetic Disk Characteristic • Cylinder: all the tracks under the heads at a given point on all surfaces • Read/write data is a three-stage process: • Seek time: position the arm over the proper track (6 to 14 ms avg.) • due to locality of disk references the actual average seek time may be only 25% to 33% of the advertised number • Rotational latency: wait for the desired sectorto rotate under the read/write head (½ of 1/RPM) • Transfer time: transfer a block of bits (sector)under the read-write head (2 to 20 MB/sec typical) • Controller time: the overhead the disk controller imposes in performing an disk I/O access (typically < 2 ms)

Magnetic Disk Examples

bus I/O System Interconnect Issues • A bus is a shared communication link (a set of wires used to connect multiple subsystems) • Performance • Expandability • Resilience in the face of failure – fault tolerance Processor Receiver Main Memory Keyboard

Performance Measures • Latency (execution time, response time) is the total time from the start to finish of one instruction or action • usually used to measure processor performance • Throughput – total amount of work done in a given amount of time • aka execution bandwidth • the number of operations performed per second • Bandwidth – amount of information communicated across an interconnect (e.g., a bus) per unit time • the bit width of the operation * rate of the operation • usually used to measure I/O performance

I/O System Expandability • Usually have more than one I/O device in the system • each I/O device is controlled by an I/O Controller interrupt signals Processor Cache Memory Memory - I/O Bus I/O Controller I/O Controller I/O Controller Main Memory Terminal Disk Disk Network

Bus Characteristics • Control lines • Signal requests and acknowledgments • Indicate what type of information is on the data lines • Data lines • Data, complex commands, and addresses • Bus transaction consists of • Sending the address • Receiving (or sending) the data Control Lines Data Lines

Step 1: Processor sends read request and read address to memory Control Main Memory Processor Data Step 2: Memory accesses data Control Main Memory Processor Data Step 3: Memory transfers data to disk Control Main Memory Processor Data Output (Read) Bus Transaction • Defined by what they do to memory • read = output: transfers data from memory (read) to I/O device (write)

Step 1: Processor sends write request and write address to memory Control Main Memory Processor Data Step 2: Disk transfers data to memory Control Main Memory Processor Data Input (Write) Bus Transaction • Defined by what they do to memory • write = input: transfers data from I/O device (read) to memory (write)

Advantages and Disadvantages of Buses • Advantages • Versatility: • New devices can be added easily • Peripherals can be moved between computer systems that use the same bus standard • Low Cost: • A single set of wires is shared in multiple ways • Disadvantages • It creates a communication bottleneck • The bus bandwidth limits the maximum I/O throughput • The maximum bus speed is largely limited by • The length of the bus • The number of devices on the bus • It needs to support a range of devices with widely varying latencies and data transfer rates

Types of Buses • Processor-Memory Bus (proprietary) • Short and high speed • Matched to the memory system to maximize the memory-processor bandwidth • Optimized for cache block transfers • I/O Bus (industry standard, e.g., SCSI, USB, ISA, IDE) • Usually is lengthy and slower • Needs to accommodate a wide range of I/O devices • Connects to the processor-memory bus or backplane bus • Backplane Bus (industry standard, e.g., PCI) • The backplane is an interconnection structure within the chassis • Used as an intermediary bus connecting I/O busses to the processor-memory bus

Bus Adaptor Bus Adaptor Bus Adaptor I/O Bus I/O Bus I/O Bus A Two Bus System • I/O buses tap into the processor-memory bus via Bus Adaptors (that do speed matching between buses) • Processor-memory bus: mainly for processor-memory traffic • I/O busses: provide expansion slots for I/O devices Processor-Memory Bus Processor Memory

Bus Adaptor Bus Adaptor I/O Bus Backplane Bus I/O Bus Bus Adaptor A Three Bus System • A small number of Backplane Buses tap into the Processor-Memory Bus • Processor-Memory Bus is used for processor memory traffic • I/O buses are connected to the Backplane Bus • Advantage: loading on the Processor-Memory Bus is greatly reduced Processor-Memory Bus Processor Memory

I/O System Example (Apple Mac 7200) • Typical of midrange to high-end desktop system in 1997 Processor Processor-Memory Bus Cache Memory Audio I/O Serial ports PCI Interface/ Memory Controller Main Memory I/O Controller I/O Controller PCI CDRom I/O Controller I/O Controller SCSI bus Disk Graphic Terminal Network Tape

Example: Pentium System Organization Processor-Memory Bus Memory controller (“Northbridge”) PCI Bus I/O Busses http://developer.intel.com/design/chipsets/850/animate.htm?iid=PCG+devside&

Control: Master initiates requests Bus Master Bus Slave Data can go either way A Bus Transaction • A bus transaction includes three parts: • Gaining access to the bus - arbitration • Issuing the command (and address) - request • Transferring the data - action • Gaining access to the bus • How is the bus reserved by a devices that wishes to use it? • Chaos is avoided by a master-slave arrangement • The bus master initiates and controls all bus requests • In the simplest system: • The processor is the only bus master • Major drawback - the processor must be involved in every bus transaction

Step 1: Disk wants to use the bus so it generates a bus request to processor Control Memory Processor Data Step 2: Processor responds and generates appropriate control signals Control Memory Processor Data Step 3: Processor gives slave (disk) permission to use the bus Control Memory Processor Data Single Master Bus Transaction • All bus requests are controlled by the processor • it initiates the bus cycle on behalf of the requesting device

Multiple Potential Bus Masters: Arbitration • Bus arbitration scheme: • A bus master wanting to use the bus asserts the bus request • A bus master cannot use the bus until its request is granted • A bus master must release the bus after its use • Bus arbitration schemes usually try to balance two factors: • Bus priority - the highest priority device should be serviced first • Fairness - Even the lowest priority device should never be completely locked out from using the bus • Bus arbitration schemes can be divided into four broad classes • Daisy chain arbitration: all devices share 1 request line • Centralized, parallel arbitration: multiple request and grant lines • Distributed arbitration by self-selection: each device wanting the bus places a code indicating its identity on the bus • Distributed arbitration by collision detection: Ethernet uses this

Grant1 Req Grant2 Req Bus Arbiter GrantN Req Centralized Parallel Arbitration • Used in essentially all backplane and high-speed I/O busses Device 1 Device 2 Device N Control Data

Synchronous and Asynchronous Buses • Synchronous Bus • Includes a clock in the control lines • A fixed protocol for communication that is relative to the clock • Advantage: involves very little logic and can run very fast • Disadvantages: • Every device on the bus must run at the same clock rate • To avoid clock skew, they cannot be long if they are fast • Asynchronous Bus • It is not clocked, so requires handshaking protocol (req, ack) • Implemented with additional control lines • Advantages: • Can accommodate a wide range of devices • Can be lengthened without worrying about clock skew or synchronization problems • Disadvantage: slow(er)

ReadReq 1 2 addr data Data 3 4 Ack 6 5 7 DataRdy Asynchronous Handshaking Protocol • Output (read) data from memory to an I/O device. • Memory sees ReadReq, reads addr from data lines, and raises Ack • I/O device sees Ack and releases the ReadReq and data lines • Memory sees ReadReq go low and drops Ack • When memory has data ready, it places it on data lines and raises DataRdy • I/O device sees DataRdy, reads the data from data lines, and raises Ack • Memory sees Ack, releases the data lines, and drops DataRdy • I/O device sees DataRdy go low and drops Ack I/O device signals a request by raising ReadReq and putting the addr on the data lines

Key Characteristics of Two Bus Standards

Review: Major Components of a Computer Processor Devices Control Input Memory Datapath Output

A Typical Memory Hierarchy • By taking advantage of the principle of locality: • Present the user with as much memory as is available in the cheapest technology. • Provide access at the speed offered by the fastest technology. On-Chip Components Control eDRAM Secondary Memory (Disk) Instr Cache Second Level Cache (SRAM) ITLB Main Memory (DRAM) Datapath Data Cache RegFile DTLB Speed (ns): .1’s 1’s 10’s 100’s 1,000’s Size (bytes): 100’s K’s 10K’s M’s T’s Cost: highest lowest

Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM 4-8 bytes (word) 8-32 bytes (block) 1 block 1,023+ bytes (disk sector = page) Characteristics of the Memory Hierarchy Processor Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory (Relative) size of the memory at each level

Memory Hierarchy Technologies • Random Access • “Random” is good: access time is the same for all locations • DRAM: Dynamic Random Access Memory • High density (1 transistor cells), low power, cheap, slow • Dynamic: need to be “refreshed” regularly (~ every 8 ms) • SRAM: Static Random Access Memory • Low density (6 transistor cells), high power, expensive, fast • Static: content will last “forever” (until power turned off) • Size: DRAM/SRAM 4 to 8 • Cost/Cycle time: SRAM/DRAM 8 to 16 • “Non-so-random” Access Technology • Access time varies from location to location and from time to time (e.g., Disk, CDROM)

bit (data) lines Each intersection represents a 6-T SRAM cell word (row) select Classical SRAM Organization (~Square) r o w d e c o d e r RAM Cell Array Column Selector & I/O Circuits column address row address One memory row holds a block of data, so the column address selects the requested word from that block data word

RAM Cell Array Classical DRAM Organization (~Square Planes) bit (data) lines The column address selects the requested bit from the row in each plane . . . r o w d e c o d e r Each intersection represents a 1-T DRAM cell word (row) select column address Column Selector & I/O Circuits row address . . . data bit data bit data bit data word

RAM Memory Definitions • Caches use SRAM for speed • Main Memory is DRAM for density • Addresses divided into 2 halves (row and column) • RASor Row Access Strobe triggering row decoder • CAS or Column Access Strobe triggering column selector • Performance of Main Memory DRAMs • Latency: Time to access one word • Access Time: time between request and when word arrives • Cycle Time: time between requests • Usually cycle time > access time • Bandwidth: How much data can be supplied per unit time • width of the data channel * the rate at which it can be used

N cols RAS Classical DRAM Operation Column Address • DRAM Organization: • N rows x N column x M-bit • Read or Write M-bit at a time • Each M-bit access requiresa RAS / CAS cycle DRAM Row Address N rows M bits M-bit Output Cycle Time 1st M-bit Access 2nd M-bit Access CAS Row Address Col Address Row Address Col Address

Ways to Improve DRAM Performance • Memory interleaving • Fast Page Mode DRAMs – FPM DRAMs • www.usa.samsungsemi.com/products/newsummary/asyncdram/K4F661612D.htm • Extended Data Out DRAMs – EDO DRAMs • www.chips.ibm.com/products/memory/88H2011/88H2011.pdf • Synchronous DRAMS – SDRAMS • www.usa.samsungsemi.com/products/newsummary/sdramcomp/K4S641632D.htm • Rambus DRAMS • www.rambus.com/developer/quickfind_documents.html • www.usa.samsungsemi.com/products/newsummary/rambuscomp/K4R271669B.htm • Double Data Rate DRAMs – DDR DRAMS • www.usa.samsungsemi.com/products/newsummary/ddrsyncdram/K4D62323HA.htm • . . .

Memory Bank 0 Memory Bank 1 CPU Memory Bank 2 Memory Bank 3 Access Bank 1 Access Bank 0 Access Bank 2 Access Bank 3 We can Access Bank 0 again Increasing Bandwidth - Interleaving Access pattern without Interleaving: Cycle Time CPU Memory Access Time D1 available Start Access for D1 D2 available Start Access for D2 Access pattern with 4-way Interleaving:

Problems with Interleaving • How many banks? • Ideally, the number of banks  number of clocks we have to wait to access the next word in the bank • Only works for sequential accesses (i.e., first word requested in first bank, second word requested in second bank, etc.) • Increasing DRAM sizes => fewer chips => harder to have banks • Growth bits/chip DRAM : 50%-60%/yr • Only can use for very large memory systems (e.g., those encountered in supercomputer systems)

N x M “SRAM” M bits 1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit RAS CAS Row Address Col Address Col Address Col Address Col Address Fast Page Mode DRAM Operation Column Address • Fast Page Mode DRAM • N x M “SRAM” to save a row N cols DRAM Row Address • After a row is read into the SRAM “register” • Only CAS is needed to access other M-bit blocks on that row • RAS remains asserted while CAS is toggled N rows M-bit Output

µProc 60%/year (2X/1.5yr) DRAM 9%/year (2X/10yrs) Why Care About the Memory Hierarchy? Processor-DRAM Memory Gap 1000 CPU “Moore’s Law” Processor-Memory Performance Gap:(grows 50% / year) 100 Performance 10 DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time

Probability of reference 0 2n - 1 Address Space Memory Hierarchy: Goals • Fact: Large memories are slow, fast memories are small • How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)? by taking advantage of • The Principle of Locality: Programs access a relatively small portion of the address space at any instant of time.

Memory Hierarchy: Why Does it Work? • Temporal Locality (Locality in Time): => Keep most recently accessed data items closer to the processor • Spatial Locality (Locality in Space): => Move blocks consists of contiguous words to the upper levels Lower Level Memory Upper Level Memory To Processor Blk X From Processor Blk Y

Lower Level Memory Upper Level Memory To Processor Blk X From Processor Blk Y Memory Hierarchy: Terminology • Hit: data appears in some block in the upper level (Block X) • Hit Rate: the fraction of memory accesses found in the upper level • Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss • Miss: data needs to be retrieve from a block in the lower level (Block Y) • Miss Rate = 1 - (Hit Rate) • Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor • Hit Time << Miss Penalty

How is the Hierarchy Managed? • registers <-> memory • by compiler (programmer?) • cache <-> main memory • by the hardware • main memory <-> disks • by the hardware and operating system (virtual memory) • by the programmer (files)

Summary • DRAM is slow but cheap and dense • Good choice for presenting the user with a BIG memory system • SRAM is fast but expensive and not very dense • Good choice for providing the user FAST access time • Two different types of locality • Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. • Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. • By taking advantage of the principle of locality: • Present the user with as much memory as is available in the cheapest technology. • Provide access at the speed offered by the fastest technology.

14:332:331 Computer Architecture and Assembly Language Fall 2003 Week 12 Buses and I/O system