320 likes | 468 Views
System Organization. Dr. Qiang Lin. To write even a modest 80x86 assembly language program requires considerable familiarity with 80x86 family To write good assembly language programs requires a strong knowledge of underlying hardware Unfortunately, underlying hardware is not consistent
E N D
System Organization Dr. Qiang Lin
To write even a modest 80x86 assembly language program requires considerable familiarity with 80x86 family To write good assembly language programs requires a strong knowledge of underlying hardware Unfortunately, underlying hardware is not consistent Techniques crucial for 8088 programs may not be useful on 80486 systems Programming techniques to provide big performance boosts on an 80486 chip may not help at all on an 80286 Fortunately, some programming techniques work well whatever microprocessor you are using Why Study System Organization
Basic System Components • John Von Neumann, a pioneer in computer design, is given credit for architecture of most computers in use today, including 80x86 family. A typical Von Neumann Architecture (VNA) system has three major components: • CPU, memory, and I/O • How a system designer combines them impacts system performance
System bus connects various components of a VNA machine 80x86 family system bus consists of 3 major busses: address bus, data bus, and control bus They vary from processor to processor, i.e., the data bus may have a different implementation on 80386 than on 8088, but both carry data between the processor, I/O, and memory A typical 80x86 system uses standard TTL logic levels Each wire on a bus uses a standard voltage level to represent zero and one Only specify zero and one rather than electrical levels because they vary on different processors, especially laptops. System Bus
Shuffle data between various components in a computer system. The bus size varies widely in 80x86 family On 80x86s, data bus contains 8, 16, 32, or 64 lines (bits) 8088, 8 bits; 8086 and 80286, 16 bits 80386, 80486, Pentium, Pentium II, Pentium III and Pentium 4, 32 bits Itanium, 64 bits 8 bit data bus does not limit processor to 8 bit data types, but processor can only access one byte of data per memory cycle Processors with a 16 bit bus naturally faster than processors with an 8 bit bus. Likewise, processors with a 32 bit bus faster than those with a 16 or 8 bit data bus Data Bus
Question - Which memory location or I/O device? To differentiate memory locations and I/O devices, a unique memory address is assigned to each memory element and I/O device When software wants to access particular memory location or I/O device, it places the corresponding address on address bus Circuitry associated with the memory or I/O device recognizes this address and instructs memory or I/O device to read the data from or place data on data bus In either case, all other memory locations ignore the request With n address lines, processor can provide 2n unique addresses Address Bus
80x86 Family Address Bus Sizes Processor Address Bus Size Max Addressable Memory In English! 8088 20 1,048,576 One Megabyte 8086 20 1,048,576 One Megabyte 80188 20 1,048,576 One Megabyte 80186 20 1,048,576 One Megabyte 80286 24 16,777,216 Sixteen Megabytes 80386sx 24 16,777,216 Sixteen Megabytes 80386dx 32 4,294,976,296 Four Gigabytes 80486 32 4,294,976,296 Four Gigabytes Pentium 32 4,294,976,296 Four Gigabytes Itanium 64
A collection of signals that control how processor communicates with the rest of the system Exact make up of the control bus varies in 80x86 family, common ones are: read and write: control direction of data on data bus. If read is low (logic zero), CPU is reading data from memory while write line is low, the system transfers data from the CPU to memory byte enable: allow 16, 32, and 64 bit CPUs to deal with smaller chunks of data Two distinct address spaces: one for memory and one for I/O; I/O address bus on all 80x86 CPUs is 16 bit; but both spaces share the same address bus for both I/O and memory addresses Others like system timing, waits, etc Control Bus
Question - What exactly is a memory location? 80x86s support segmentedbyte addressable memory so basic memory unit is a byte located in different segment A word uses 2 consecutive memory addresses or 2 bytes. Similarly, a double word uses 4 consecutive memory locations Examples of memory locations and 8 bit, 16 bit and 32 bit data bus, respectively Memory Subsystem
80x86 family provides a 16 bit I/O address bus Other than separate control lines and a smaller bus, I/O addressing behaves exactly like memory addressing Memory and I/O devices both share the same data bus Three limitations Require special instructions to access I/O devices IBM PC designers used the "best" I/O locations for their own purposes, forcing third party developers to use less accessible locations Can address no more than 65,536 I/O addresses I/O Subsystem
On Von Neumann machines, such as 80x86s, most operations are serialized Computer executes commands in a prescribed order To ensure all operations occur at just the right moment, 80x86 CPUs use an alternating signal called the system clock At the most basic level, system clock handles all synchronization within a computer system System Timing and Clock
The frequency with which system clock alternates between zero and one is system clock frequency The time for system clock to switch from zero to one and back to zero is clock period, also called a clock cycle. Most CPUs start an operation on falling or rising edge Since all CPU operations are synchronized around clock, CPU cannot run tasks faster than clock Many operations take multiple clock cycles to complete so CPU often performs operations at a significantly lower rate System Clock (Continued)
An operation synchronized around the system clock Many 80x86 processors take several clock cycles to access a memory location ranging from one to four clock cycles 8088 and 8086 CPUs require four clock cycles to access memory; while 80486 requires only one Memory access time is time between a memory operation request (read or write) and memory operation completes Examples of read and write Memory Access and System Clock
A wait state is nothing more than an extra clock cycle to give some device time to complete an operation In abovr, system loses 10ns to buffe and decode. If CPU needs data back in 20 ns for an 50 MHz 80486, memory must respond in less than 10 ns If cost-effective memory won't work with a fast CPU, an answer is wait state. For example, if a 20 MHz processor has a memory cycle time of 50 ns and loses 10 ns to buffer and decode, it needs 40 ns memory. If can only afford 80 ns memory, then add a wait state to extend memory cycle to 100 ns (two clock cycles) will solve this problem as: Subtract 10ns for decoding and buffering leaves 90 ns. Therefore, 80 ns memory will respond well before CPU requires the data Wait States
A typical program tends to access the same memory locations repeatedly and adjacent memory locations, known as temporal locality of reference and spatial locality of reference, respectively Generally, a typical program might only use 10-20% of memory allotted to it Cache memory sits between CPU and main memory A small amount of very fast (zero wait state) memory Bytes appearing within a cache do not have fixed addresses. Instead, cache memory can reassign the address of a data object; allows to keep recently accessed values in cache. Addresses that CPU has never accessed or hasn't accessed in some time remain in main (slow) memory. Since most memory accesses are to recently accessed variables, the data generally appears in cache memory A cache hit occurs whenever CPU accesses cache and finds data in cache. A cache miss occurs if CPU accesses cache and data is not there Cache Memory
A special type of data structure in which items are removed in the reverse order from that in which they are added The most recently added item is the first one removed, also called last-in, first-out (LIFO) Adding an item to a stack called push Removing an item from a stack called pop Stack Memory
Special memory locationsimplemented on CPU chip Registers on 8086 Four segment registers that partition memory into 64 Kbytes segments so can address a maximum of 256 Kilobyte at a time: CS (code segment) for program SS (stack segment) for stack memory DS (data segment) for data ES (extra segment) for extra data Four general-purpose registers: AX, accumulator register for word multiply, divide and I/O BX, base register for translation CX, count register for string operations and loops DX, data register for word multiply, divide and indirect I/O 8086 CPU Registers
One instruction pointer, IP, register to identify next instruction location to be fetched Two pointer registers to store offset addresses for SS: SP, stack pointer BP, base pointer Two index registers to store offset addresses for DS or ES: SI, source index DI, destination index One status register, also known as flags register: for six status bits as CF for carry flag, PF for parity flag, AF for auxiliary carry, ZF for zero, SF for sign, OF for overflow for three control flags as IF for interrupt enable, DF for direction, and TF for trap 8086 CPU Registers (Continued)
Segmentation, Pros: Provides a powerful memory management mechanism Allows programmers to partition their programs into modules that operate independently of one another Allows two processes to easily share data and to extend CPU addressability Segmentation, Cons: Limited addressability within a segment Harder to run program with huge memory requirement Segmented Address Space
Data-Transfer Move data either between its internal registers or between an internal register and a storage location in memory Arithmetic Include instructions for addition, subtraction, multiplication, and division Logic Perform logic operations AND, OR, XOR and NOT Shift Perform logical shift and arithmetic shift Rotate Include rotate left, right, left through carry and right through carry 8086 Instruction Set
Segment base address (SBA) Identifies starting location of the segment in memory Effective address (EA) Represents offset of the operand from beginning of the segment To reference an operand in memory, 8086 must calculate physical address (PA) PA = SBA {CS, SS, DS, ES} : EA EA can be any combination or all of Base {BX, BP} + Index { SI, DI} + Displacement {8-bit, 16-bit} 8086 Physical Address Calculation
Support six addressing modes: immediate addressing, encode operand’s data directly in instruction without involving memory or other registers direct addressing, encode operand’s effective address directly in instruction register indirect addressing, PA = {CS, DS, SS, ES}:{BX, BP, SI, DI} based addressing, EA = {BX, BP} + {direct, indirect} displacement indexed addressing, similar to based addressing but using displacement value as a pointer to starting point of an array in memory based-indexed addressing for two dimensional arrays PA = {CS, DS, SS, ES} : {BX, BP} + { SI, DI} + displacement Base register for m coordinate and index register for n coordinate 8086 Addressing Modes
x86 CPUs may notcomplete execution of an instruction in a single clock cycle CPU often executes several steps for each instruction For example, CPU issues the following instruction involving: mov reg, reg/memory/constant Fetch byte from memory Update IP register to point at next byte Decode instruction to see what it does If required, fetch a 16-bit instruction operand from memory If required, update IP to point beyond operand Compute operand address, if required Fetch the operand Store fetched value into destination register Step-by-Step Instruction Execution
8086 - slowest, no one instruction can complete in one clock cycle The above mov instruction takes between 5 and 12 clock cycles to execute depending upon the operands The longer instruction, takes more time Instruction without memory access executes faster Instructions using complex addressing modes run slower 80286 - major improvement is the pre-fetch queue Whenever CPU is not using Bus Interface Unit (BIU), BIU can fetch additional bytes from the instruction stream Whenever CPU needs an instruction or operand byte, it grabs the next available byte from the pre-fetch queue Differences Between 80x86 Processors
80486 - incorporates pipeline to improve performance With six stage pipeline it can overlap the execution of six separate instructions as below: Once CPU fills pipeline, it completes one instruction on each cycle This is true even if there are complex addressing modes to be computed, memory operands to fetch, or other operations which use cycles on a non-pipelined processor Differences Between 80x86 Processors (Cont)
Unfortunately, previous scenario is too simplistic Two problems, bus contention among instructions and non-sequential execution, may increase average execution time in pipeline For example, one instruction needs to store data in memory and the other reads data from memory. then, contention for the address and data bus may develop since CPU will be trying to simultaneously fetch data and write data in memory One simplistic way to handle bus contention is through a pipeline stall The CPU, when faced with contention for the bus, gives priority to the instruction furthest along in the pipeline The CPU suspends fetching opcodes until current instruction fetches (or stores) its operand Cause new instruction in pipeline to take two cycles to execute Stalls In a Pipeline
What happens when an instruction modifies IP register? E.g, the time the instruction jmp 1000 completes execution, already started five other instructions and only one clock cycle away from completion of the first. Obviously, CPU must not execute those instructions or it will compute improper results The only reasonable solution is to flush the entire pipeline and begin fetching opcodes anew! Doing so causes a severe execution time penalty since it will take 6 clock cycles before next instruction completes execution This is why a pipeline is not long and should keep control transfer instructions to a minimum in a pipelined system Flush a Pipeline
Bus contention can be reduced by prefetch The prefetch queue simply acts as a buffer between the instruction stream in memory and the opcode fetching circuitry But 80486 can not take advantage of prefetch queue as 80286 since it constantly accesses memory Only when CPU not accessing memory, BIU can fetch additional opcode bytes for prefetch queue. Thus, 80486 prefetch queue can not use any "dead" bus cycles to fetch additional opcode bytes Prefetch Queue of 80486
Bus contention can also be reduced by cache Be aware of how cache operates to write best programs On-chip instruction caches are generally quite small. So, the shorter the instructions, the more of them will fit in the cache. The more instructions are in cache, the less often bus contention will occur Using registers to hold temporary results places less strain on data cache so it does not need to flush data to memory or retrieve data from memory so often. Use the registers wherever possible! Cache of 80486
I/O-mapped I/O CPU use special instructions to transfer data between the computer system and outside world A printer uses three ports: a read/write port, an input port and an output port Memory-mapped I/O CPU use special memory locations in normal address space of CPU to communicate with real-world devices I/O (Input/Output)
Direct Memory Access (DMA) To input ten bytes (words) from an input port and store them into memory CPU must read each value and store it into memory. For very high-speed I/O devices CPU may be too slow when processing data one byte (word) at a time A special form of memory-mapped I/O where peripheral device reads and writes memory without going through CPU known as direct memory access Allow I/O operation to proceed in parallel with other CPU operations to increase overall system speed I/O (continued)
I/O devices may not accept data at an arbitrary rate, e.g.: A Pentium PC can send several million characters/s to a printer, but printer is unable to print them in a second A keyboard is unable to provide several million keystrokes/s CPU needs some way to coordinate data transfer between the computer system and its peripheral devices Polled I/O CPU constantly tests a status I/O port to see if data is available for receiving or device is ready for sending; inherently inefficient Interrupt An external hardware event to cause CPU to stop current instruction sequence and call a special interrupt service routine Need to save the current CPU contents for restore later Polled I/O and Interrupt