440 likes | 554 Views
MIPS 64-bit processors Project – MSCS 521 Computer Architecture MANAN SHAH ( Block Diagram & its detailed explanation, Instruction set) CHINTAN SHIHORA (Overview, Features, Intro to 64-bit processing , pipelining information , Pros & Cons). [Marist College, Spring -2008]
E N D
MIPS 64-bit processors Project – MSCS 521 Computer Architecture MANAN SHAH ( Block Diagram & its detailed explanation, Instruction set) CHINTAN SHIHORA (Overview, Features, Intro to 64-bit processing , pipelining information, Pros & Cons) [Marist College, Spring -2008] Guided By: Prof. James Ten Eyck
Overview • The MIPS Instruction Set Architecture has evolved over time from the original MIPS 1 ISA, through the MIPS 5 ISA, to the current MIPS32 and MIPS64 Architectures. All extensions have been backward compatible with previous versions of the Instruction Set Architecture . • In the MIPS 3 level of the Instruction Set Architecture, 64-bit integers and addresses were added to the instruction set., while in MIPS 4 and MIPS 5 levels of the Instruction Set Architecture added improved floating point operations, as well as a set of instructions intended to improve the efficiency of generated code and of data movement.
Overview (cont.) • The 64 bit MIPS Architecture is based on the MIPS 5 ISA and is backward compatible with the MIPS32 Architecture. Both the MIPS32 and MIPS64 Architectures bring the privileged environment into the Architecture definition to address the needs of operating systems and other kernel software. The MIPS64 Architectures are intended to address the need for a high-performance but cost-sensitive MIPS instruction set. • It include facilities like adding MIPS Application Specific Extensions , User Defined Instructions, and custom coprocessors to address the specific needs.
What 64-bit refers to? It refers to the number of bits that can be processed or transmitted in parallel, in short a microprocessor that indicates the width of the registers; a special high-speed storage area within the CPU. 64-bit therefore refers to a processor with registers that store 64-bit numbers. 64-bit architecture would double the amount of data a CPU can process per clock cycle.
Need of 64-bit processor It is needed for the applications that address large amounts of data and memory, such as high-performance servers, database management systems, CAD tools, and digital content creation tools. One reason why one can need 64-bit processors is because of their enlarged address spaces. Thirty-two-bit chips are limited to a maximum of 2 GB or 4 GB of RAM access. However, a 4-GB limit can be a severe problem for server machines and machines running large databases. A 64-bit chip has none of these constraints because a 64-bit RAM address space is essentially infinite 2^64 bytes of RAM.
FEATURES OF MIPS 64-bit processors
Features: There are 64-bit virtual addresses There is a 64-bit instruction pointer . New RIP-relative data addressing mode. Flat address space with single code, data, and stack space. Dual-Issue 64-bit superscalar architecture High-performance 64-bit integer unit. High-throughput fully pipelined 64 bit floating point unit . High performance SysAD interface.
Features: (cont.) • 32-bit or 64-bit multiplexed system address/data bus for optimum price/performance. • Available with 32-bit or 64-bit external bus interface. • Supports fractional clock ratios. • JTAG boundary scan. • Integrated primary caches: • 32 KB instruction and data are 2-way set associative. • Virtually indexed & physically tagged.
Features: (cont.) • Write-back and write-through on per-page basis. • Index address modes (register + register). • Pipeline restart on first double word for data cache misses. • 64-bit MIPS instruction set architecture • Floating point multiply-add instruction increases performance in signal processing and Graphics applications. • Conditional moves to reduce branch frequency.
INSTRUCTION SET FOR MIPS 64-bit processors
BLOCK DIAGRAM FOR MIPS 64-bit processors
Block Diagram: It supports four floating-point multiply-add/subtract instructions which allow two separate floating-point computations to be performed with one instruction. The four instructions are : 1. Multiply-add (MADD) 2. Multiply-subtract (MSUB) 3. Negative Multiply-add (NMADD) 4. Negative Multiply-Subtract (NMSUB)
Detailed Explanation (For Block Diagram) Index : 1 ) Large On-chip Caches 2) Dual Entry TLB 3) Write Buffer 4) Pipelining 5) Dual-Issue Mechanism 6) Dedicated Integer and FP ALU’s 7) Separate FP Execution Units 8) Scaleable for Multiple Processors 9) Secondary Cache Support 10) Multiple Cache Sizes 11) Simultaneous Access 12) Flexible Clocking Mechanism 13) On-chip Clock Multiplication Circuitry
Large on- chip Caches: (Detailed explanation- Block diagram) • MIPS 64 bit processor contains separate 32 kB data and instruction caches. • Each cache is 2-way set associative, which helps to increase the hit rate over a direct-mapped implementation • Cache lines may be classified as write-through or write-back on a per-page basis. • Both caches are virtually indexed and physically tagged. • a) A virtually indexed cache allows the cache access to begin as soon as the virtual address is generated, as opposed to waiting for the virtual to physical translation. The cache is accessed at the same time as the address translation is performed. The physical address is then compared against the corresponding instruction or data cache tag. If the compare is valid, the data which has been retrieved from the cache is used. If the compare is not valid, meaning that the address requested does not reside in the cache, the data is not used and a cache miss is generated.
Large on- chip Caches: (cont.) (Detailed explanation- Block diagram) • b) While in Physically tagged data cache allows for coherency between the primary and secondary caches in a system. • Having large primary caches allows more of the application to be executed on-chip, reducing accesses to slower secondary cache and main memory. This in turn reduces bus utilization and allows the application to run faster since fewer off-chip accesses are required.
Dual Entry TLB: (Detailed explanation- Block diagram) • The TLB of the MIPS 64 bit processor contains 48 dual entries. This implementation is equivalent to a 96-entry TLB • Each virtual page number entry equates to two physical frame numbers one even and one odd. • The lower bit of the Virtual Page Number is used to determine whether the even or odd PFN will be used. • The TLB is fully-associative.
Write Buffer: (Detailed explanation- Block diagram) • Writes to external memory • The write buffer holds up to four 64-bit address and data pairs, or one cache line to be written out. • Since data cache writebacks are typically performed on a line basis, an entire line can be written to the buffer, allowing the CPU to resume normal execution. • Without a write buffer, the CPU would have to write a single 64-bit doubleword, then wait until the memory operation completes, before writing another.
Write Buffer: (cont.) (Detailed explanation- Block diagram) • The write buffer allows the CPU to write data into the buffer without accessing the system bus. • For uncached write cycles, the write buffer can significantly increase performance by allowing the pipelining of multiple writes. • With cacheable write cycles, the buffer allows the CPU to write data to the buffer and immediately begin processing the next write data. • Without the buffer, the CPU would output the write data, then be forced to wait until the uncached write operation has completed before processing the next write.
Pipelined Writes: (Detailed explanation- Block diagram) • Write cycles can be performed back-to-back without any dead clocks between cycles. • In the original R4000 architecture there is a two clock delay between the generation of back-to-back addresses. This results in two dead clocks between back-to-back cycles. • The pipelined write protocol also uses the write buffer to allow pipelining of write cycles. • In the MIPS 64 bit processor, performance is significantly increased by eliminating the two null cycles between each write cycle.
Pipelining : (Detailed explanation- Block diagram) • A pipeline is divided into : • Fetch • Arithmetic operation • Memory access • Write back A non-pipelined execution Pipelined execution
Pipelining (cont.) • In the example shown in Figure , each stage takes one processor clock cycle to complete. • Thus it takes four clock cycles (ignoring delays or stalls) for the instruction to complete. In this example, the execution rate of the pipeline is one instruction every four clock cycles. • Conversely, because only a single execution can be fetched before completion, only one stage is active at any time.
Parallel Pipelining • Instead of waiting for an instruction to be completed before the next instruction can be fetched , a new instruction is fetched each clock cycle. • There are four stages to the pipeline so the four instructions can be executed simultaneously, one at each stage of the pipeline. • Instructions in Figure are executed at a rate four times that of the pipeline shown in the previous figure.
SuperPipeline • Figure below shows a superpipelined architecture. • Each stage is designed to take only a fraction of an external clock cycle—in this case, half a clock. • Therefore more than one instruction can be completed each cycle.
SuperScalar Pipeline • A superscalar architecture also allows more than one instruction to be completed each clock cycle.
How Pipelining Works: • The processor fetches and decodes four instructions per cycle and then appends them to one of the three instruction queue. • Each queue determines the execution order based on the availability of the required FUs. • Though initially fetched and decoded in order, processor to have up to 32 instructions in various stages of execution.
How Pipelining Works: (cont.) • Initially, Instructions proceed through the instruction fetch pipeline which consist of fetch, decode, and issue stages: • in the fetch stage. Four instructions are fetched and aligned. • in the decode stage, the instructions are decoded, register renaming as performed, and branch instructions are predicted • in the issue stage (first half), the instructions are written to one of three 16-entry instructions queue, the availability of the operands is also determined. • (second half is on the next slide)
How Pipelining Works: (cont.) • Depending on the type, the instruction proceeds to one of the five instruction pipelines. • There are two integer and two floating-point pipelines, and one load/store execution pipeline. • Each of these pipelines begins when a queue issue and instruction and continue as follows: • in the issue stage (second half ), the processor reads operands from the register files, • the execution begins and takes • a) one stages in the case of integer pipelines • b) two stages in the case of the load/store pipeline • c) three stages in the case of floating-point pipeline
Floating point Co-processor: • Performance is gained on floating-point codes by allowing the integer unit to execute the necessary loads and stores of floating-point values. As well as index register updates and branching. • The issue logic allows the dual of the integer instruction and a floating-point instruction.
Dual Issue Mechanism: (Detailed explanation- Block diagram) • The dual-issue mechanism implemented in 64 bit MIPS processor allows a floating-point ALU instruction to be issued simultaneously with any other instruction type. • Whenever a floating-point ALU instruction is fetched with any non- FP-ALU instruction, both instructions can be issued in the same cycle. • Load and store instructions in one pipeline usually provide enough data bandwidth to permit a new instruction to be issued every cycle for a fix period. • Well structured code can take full advantage of this pipeline structure.
Dedicated Integer & FP ALU: (Detailed explanation- Block diagram) • Separate Integer and FP ALU’s allow instructions of both types to be performed simultaneously. • Integer instructions are not stalled while long latency floating-point operations are being executed. • Use: Running CAD-type applications as both fixed-point and floating-point math calculations.
Scalable for Multiple processor: (Detailed explanation- Block diagram) • The 64 bit MIPS processor incorporates 8 external signals. • These signals allow for arbitration and data coherency between processors. • Therefore, Symmetric multiprocessing systems implementing the full Modified Exclusive Shared Invalid cache consistency protocol in both primary and secondary caches, as well as other styles of multiprocessing will be supported.
Separate FP Execution Units: (Detailed explanation- Block diagram) • In addition to the dual-issue mechanism, the 64 bit MIPS processor also contains separate acceleration hardware for most floating-point ALU instructions. • This allows long-latency operations such as divide and square-root to be performed in a dedicated unit, thereby allowing other shorter-latency operations such as MADD and subtract to be overlapped while the divide or square-root operation is in progress.
Secondary Cache Support: (Detailed explanation- Block diagram) • The 64 bit MIPS processor contains a dedicated secondary cache interface. • These signals provide an efficient interface between the processor, the secondary cache, and the secondary cache tag RAM. • All AM interface signals such as data and chip enables, output enable, address match, cache valid, line index, and word index are provided by the processor. • The secondary cache also supports multiple cache sizes and both the write-through and write-back data transfer protocols. • Data transfers to the secondary cache share the 64-bit system bus.
Multiple Cache Sizes: (Detailed explanation- Block diagram) • The secondary cache can be configured as 512 kB, 1Mbyte, or 2 Mbyte, allowing large applications to run within the secondary cache, reducing the number of accesses to slower main memory. • The secondary cache is accessed through the system bus. • Uncached bus cycles are not evaluated by the secondary cache control logic as they travel to the external agent. • Uncached operations such as video screen updates can be passed directly to the system logic responsible for routing the data to the screen without any delays from the secondary cache logic.
Simultaneous Access: (Detailed explanation- Block diagram) • To maximize data throughput, the main memory accesses can be initiated while the secondary cache tag is being compared. • If the requested address is found to be in the secondary cache, the memory access is aborted & if the address is not found in the secondary cache, then main memory access can be initiated and the data can be retrieved more quickly.
Flexible Clocking Mechanism: (Detailed explanation- Block diagram) • The clocking mechanism in the 64 bit MIPS processor offers a number of pipeline frequencies based on the frequency of the input clock. • Single External Clock Signal • A single clock signal is used for the system interface, as opposed to three. The processor eliminates the Rclock, Tclock, and MasterOut clock signals that existed in the previous processors. • Having only one clock simplifies system design, as well as reducing the circuit complexity of the internal clock mechanism.
On Chip Clock Multiplication Circuitry: (Detailed explanation- Block diagram) • The 64 bit processor includes on-chip clock frequency multiplication circuitry to support 200-MHz internal operation from an external 50-MHz clock. • The processor has the option of operating internally at 2, 3, or 4 times the frequency of the external clock. • Maximum bus speed of the system interface is 100 MHz.
Advantages: It can handle more memory and larger files. 64-bit architecture will allow systems to address up to 1 terabyte (1000GB) of memory 64-bit machines also offer faster I/O speeds to things like hard disk drives and video cards. These features can greatly increase system performance.
Disadvantages: The same data occupies more space in memory. This increases the memory requirements of a given process and can create problems for efficient processor cache utilization. 64-bit systems sometimes lack equivalents to software that is written for 32-bit architectures. The most severe problem is incompatible device drivers. Although most software can run in a 32-bit compatibility mode, it is usually impossible to run a driver in that mode.
References: • http://en.wikipedia.org/wiki/MIPS_architecture • 2) http://en.wikipedia.org/wiki/Superscalar • 3) http://www.intel.com/cd/ids/developer/asmo-na/eng/ • microprocessors/ia32/pentium4/optimization/44015.htm • 4)“MIPS Architecture.” 17 April 2004. Wikipedia, • The Free Encyclopedia http://en.wikipedia.org/wiki/Main_Page 23 • April 2004 http://en.wikipedia.org/wiki/MIPS_architecture. • 5) http://www.google.com/search?hl=en&q=2010740_004404%5B1%5D.pdf • 6) http://books.google.com/books?id=Nibfj2aXwLYC&pg=PA384&dq=MIPS+R5000+ • Microprocessor+and+pipelining+operation&sig=nYGolNlOk5S_ePkXDKiVdnfORDY • 7) http://books.google.com/books?id=JEYKyfZ3yF0C&pg=PA195&dq= • MIPS+R5000+Microprocessor+and+pipelining+operation&sig= • qr82jZMTWo8Z0YWqMWScerbF0XQ#PPA195,M1