330 likes | 342 Views
Explore the performance analysis of synchronous and asynchronous buses for storage systems, comparing bandwidth, latency, and transaction efficiency. Learn how bus design impacts data transfer speeds.
E N D
Chapter 8: Part II Storage, Network and Other Peripherals
Performance Analysis: Sync. vs. Async. • Synchronous bus: clock time=50ns, each transaction takes one clock cycle • Asynchronous bus: 40 ns per handshake • Data portion=32 bits • Question: Find the bandwidth of each bus when performing one-word reads from a 200ns memory.
Sync. vs. Async. Buses (I) • For the synchronous bus: • Send the address to memory:50 ns • Read the memory: 200 ns • Send the data to the device: 50 ns • Total time= 300 ns, bandwidth=4bytes/300ns=13.3 MB/sec
Sync. vs. Async. Buses (II) • For the asynchronous bus: • Step 1: 40 ns • Step 2,3,4: max(3x40, 200ns)=200ns • Step 5,6,7: 3x40ns = 120ns • Total time = 360 ns, maximum bandwidth= 4bytes/360ns = 11.1 MB/s
Increasing Bus Bandwidth • Data bus width • Separate versus multiplexed address and data lines • Block transfers
Performance Analysis of Two Bus Schemes • Given a system with • a memory and bus system supporting block access of 4 to 16 words • a 64-bit synchronous bus clocked at 200MHz, with each 64-bit transfer taking 1 clock cycle, and 1 clock cycle to send an address to memory • two clock cycles needed between each bus operation • memory access for first 4 words takes 200ns, each additional set of 4 words requires 20ns
Question • Find the sustained bandwidth and latency for a read of 256 words for transfers using 4-word blocks and 16-word blocks. • Find the effective number of bus transactions for each case.
4-Word Block Transfer • 1 clock cycle to send address to memory • 200ns/(5ns/cycle) = 40 cycles to read memory • 2 cycles to send data from memory • 2 idle cycles • Total = 45 cycles • 256 words requires 45x64= 2880 cycles
4-Word Block Transfer • Latency = 2880 cycles x 5ns/cycle = 14400 ns • Number of bus transactions = 64 x 1s/14400ns = 4.44M transactions/s • Bandwidth = (256x4 bytes)x 1/14400ns = 71.11 MB/s
16-Word Block Transfer • 1 clock cycle to send address to memory • 40 cycles to read first 4 words from memory • 2 cycles to send data, during which the read of the next 4 words is started. • 2 idle cycles between transfers, during which the read of the next block is completed. • Need to repeat the last two steps 3 times to read a total of 16 words.
16-Word Block Transfer • Total cycles required = 1 + 40 + 4x(2+2) =57 cycles • 256/16=16 transactions are required • Total number of cycles required for 256 word = 16x57 = 912 cycles, latency = 4560 ns • Number of bus transactions = 16 x 1s/4560ns = 3.51M transactions/s • Bandwidth = (256x4 bytes)x 1/4560ns = 224.56 MB/
Bus Arbitration • Daisy chain arbitration (not very fair) • Centralized arbitration (requires an arbiter), e.g., PCI • Self selection, e.g., NuBus used in Macintosh • Collision detection, e.g., Ethernet
Bus Standards • PCI ( a general purpose backplane bus) • SCSI (Small Computer System Interface) • IEEE 1394 (Firewire) • USB 2.0
Interfacing I/O Devices • How is a user I/O request transformed into a device command and communicated to the device? • How is data actually transferred to or from a memory location? • What is the role of the operating system?
Role of the OS • The OS plays a major role in handling I/O, in that: • I/O system is shared by multiple programs using the processor • I/O system often use interrupts (cause transfer to supervisor mode) • low-level control of I/O is complex
Communications between OS and I/O Devices • The OS must be able to give commands to I/O. • The I/O must be able to notify the OS when operation is completed or error has occurred. • Data must be transferred between memory and an I/O device.
Giving Commands to I/O • To give a command, the processor must be able to address the device and to supply command words: • memory-mapped I/O: portions of the address space is assigned to I/O devices • special I/O: dedicated I/O instructions in the processor.
Communicating with the Processor • Polling • Interrupts • DMA
Polling • Polling: processor periodically checks the status of I/O. • Overhead of polling in an I/O system • Example 1: mouse • Example 2: floppy disk • Example 3: hard disk
Mouse • Assume the number of clock cycles for a polling operation, including transferring to the polling routine, accessing the device, and restarting the user program, is 400, with a 500 MHz clock. • The mouse must be polled 30 times a second to ensure that no user movement is missed. • Fraction of CPU time = 30x400/(500x10^6) = 0.002%
Floppy Disk • The floppy disk transfers data to the processor in 16-bit units and has a data rate of 50KB/s. • Polling rate = (50KB/s)/(2 Bytes/polling)= 25K polling/sec • Fraction of CPU time = 25Kx400/(500x10^6) = 2%
Hard Disk • Transfer in 4-word blocks • transfer rate: 4MB/s • Polling rate = (4MB/s)/(4x4 Bytes/polling)= 250K polling/sec • Fraction of CPU time = 250Kx400/(500x10^6) = 20%
Overhead of Polling • Can do the polling only when the device is active, thus reducing the overhead. • However, the overhead is still significant, resulting in another design called interrupt-driven I/O.
Overhead of Interrupt-Driven I/O • Assume the overhead for each transfer, including the interrupt, is 500 cycles. • Cycles per second for disk = 250Kx500= 125x10^6 cycles • Fraction of processor consumed = 125x10^6/(500x10^6) = 25% • Assuming disk is transferring data 5% of the time, fraction of CPU on average = 25%x5%=1.25%
Direct Memory Access(DMA) • If disk is transferring data most of the time, the overhead for interrupt-driven I/O is still high. • For high-bandwidth device, let the device controller transfer data directly to or from the memory without involving the processor, known as direct memory access. • Interrupt is used to signal the completion of I/O transfer or error. • Note: How does it affect cache design?
Overhead of I/O Using DMA • Assume initial setup of DMA transfer takes 1000 cycles, handling of interrupt at DMA completion takes 500 cycles, average transfer from disk is 8KB • Each DMA transfer takes 8KB/(4MB/s) = 2x10^-3s • If the disk is constantly transferring data, it requires: (1000+500)/(2x10^-3) = 750x10^3 cycles • Fraction of CPU time= 750x10^3/(500x10^6) = 0.15%
I/O System Design • Latency constraints: ensuring the latency to complete and I/O operation is bounded. • Bandwidth constraints • Performance Analysis techniques:— queuing theory— simulation— analysis
I/O System Design- Example • CPU: 3 BIPS, average 100,000 instructions in the OS per I/O operation • backplane bus transfer rate: 1000 MB/s • SCSI-Ultra 320 controller with transfer rate = 320 MB/s, accommodating up to 7 disks • Disk bandwidth = 75MB/s, seek+rotational latency=6 ms • Workload: 64-KB reads, user program need 200,000 instructions per I/O
Example • Find • the maximum sustainable I/O rate • the number of disks and SCSI controller required.