620 likes | 874 Views
Welcome to the Presentation Pang Kee Yeoh Indraneel Mitra Majid Jameel. Presentation Overview. Features of Itanium Future of Itanium Competition for Itanium. Intel Itanium Architecture.
E N D
Welcome to the PresentationPang Kee YeohIndraneel Mitra Majid Jameel
Presentation Overview • Features of Itanium • Future of Itanium • Competition for Itanium
Intel Itanium Architecture Itanium is a new processor family and architecture, design by Intel and HP with the future of high end server and workstation in mind.
Features of Itanium • 64-bit addressing • EPIC (Explicit Parallel Instruction Computing) • Wide Parallel Execution core • Prediction • FPU, ALU and Rotating registers • Large fast Cache • High Clock Speed • Scalability • Error Handling • Fast Bus Architecture
Itanium Specifications • Physical Characteristics • 25.4M transistors • .18micron CMOS process • 6 metal layers • C4 (flip-chip) assembly technology • 1012-pad organic land grid array • 733MHz and 800MHz initial release clock speeds
Itanium Specifications Cont… • Instruction Dispersal • 2 bundle dispersal windows • 3 instructions per bundle • 9 function unit slots • 2 integer slots • 2 floating point slots • 2 memory slots • 3 branch slots • Maximum of 6 instructions issued each cycle
Itanium Specifications Cont… • Floating Point Units • 2 extended and double precision FMACs (Floating-point Multiply Add Calculators) • 4 double or single precision operations per clock maximum • 3.2 GFLOPS of peak double precision floating point performance at 800MHz • 2 additional single precision FMACs • 4 single precision operations per clock maximum • 6.4 GFLOPS of peak single precision floating point performance total at 800MHz
Itanium Specifications Cont… • Integer and Branch Units • 4 single cycle integer ALUs • 4 MMX units • 3 branch units
Itanium Specifications Cont… • Level 3 Cache • Off-die in two or four chips • 2MB or 4MB • Runs at core clock • 4-way set associative • Up to 294.8 million transistors • 128-bit bus • 21+ cycle latency
Itanium Specifications Cont… • Level 2 Cache • On-die • 96k of full-speed cache • 6-way set associative • 256-bit bus • 6-cycle + latency
Itanium Specifications Cont… • Level 1 Cache • On-die • 16k instruction cache • 4-way set associative • 16k integer only data cache • 2-cycle + latency
Itanium Specifications Cont… • x86 Compatibility • Hardware decoder turns x86 instructions into EPIC instructions • Dynamic scheduler optimizes x86 for EPIC micro-architecture • Shared cache • Shared execution core
64-bit addressing • EPIC processors are capable of addressing a 64-bit memory space. In comparison, 32-bit x86 processors access a relatively small 32-bit address space, or up to 4GB of memory. • A 64-bit memory space may be a limiting factor to performance. This gives the Itanium the memory addressing ability needed to meet current and foreseeable future high-end processing needs.
64-bit addressing cont… • Through bank switching, x86 processors, such as the Intel Pentium III Xeon and the AMD Athlon, can address more than 4GB of memory. Unfortunately, there is hardware and software overhead to bank switching that harms performance and increases complexity.
64-bit addressing cont… • The first generation of Itanium systems, using the 460GX chipset, will be expandable with up to 64GB of memory. Generations beyond that will be able to take more memory. Higher end Itanium systems designed by the likes of SGI, IBM and HP should eventually be able to take far more than 64GB. • While it may be hard to imagine 4GB or even 64GB of memory being a bottleneck to performance, when one considers SGI has mentioned plans to eventually build machines using 512 Itanium processors accessing more than a terabyte of data in main memory, 64GB of memory, let alone 4GB, begins to look rather small.
EPIC • New Computer Architecture standard set by Intel on its new itanium architecture • Previously Computer architectures only consisted of RISC, CISC and VLIW • EPIC Uses complex instruction in additions to basic instruction. This complex instruction includes information on how to run the instruction parallel with other instructions. • EPIC instructions are put together by the compiler into a threesome called a bundle.
EPIC continue….. • Bundle is a three instruction wide word - improves instruction level parallelism. Each Bundle Contains three instructions and a template field which are set during code generation, by a compiler, or the assembler. • Bundles are then sent to the CPU. • Bundles in the CPU are put together in an instruction group with other instructions • An instruction group is a set of instructions which do not have “read after write or write after write dependencies between them and may execute in parallel.” This means that the bundle do not affect each other with the data they are working on, so they can run together without getting in each others way.
EPIC continue…. • In any given clock cycle, the processor executes as many instructions from one instruction group as it can according to resources. • An instruction group must contain at least one instruction but the number of instructions in an instruction group is not limited. • The instruction groups can end by cycle breaks or end dynamically during run time by taken branch
EPIC continues….. • In addition of grouping operations into instructions, the compiler handles several other important tasks that improve efficiency, parallelism and speed. • CISC puts most of the burden of scheduling instructions onto the CPU hardware. RISC gives some of this responsibility to the compiler. VLIW gives even more importance to the compiler. • EPIC improves on previous technology by adding branch hints, register stack and rotation, data and control speculation and memory hints. It also uses branch prediction.
Prediction • It is a compiling technique that optimises or removes branching code by working it so that much of the code runs in parallel. • It minimises the time it takes to run if – then – else situations and uses processor width to run both the ‘then’ and ‘else’ in parallel. • When the ‘if’ branch is determined, the incorrect branch result is discarded. • By removing branches and making code more parallel, prediction reduces the number of cycles it takes to complete a task while making use of a wide processor.
Prediction According to Jerry Huck of HP: • “Imagine that you are walking into the bank. You will make either a deposit or a withdrawal. The teller may predict you will make a withdrawal as they know you usually do, so they fill out a with drawl form as you get in line. If you get to the front and make a withdrawal, all is well, but if you are there to make a deposit, the teller then has to fill out the deposit slip and the time it takes to complete the transaction increases. • With Prediction, the teller is ambidextrous and, when you get in line they fill out both a with drawl and a deposit slip, so that when you get to the front, no matter what task you intend on doing, the process will run without a hitch.”
Prediction Continue…. • In the metaphor, prediction is the tellers knowledge that they should fill out both the deposit and withdrawal form before they know exactly what you want. The teller’s ambidexterity, the ability to fill out both forms at once, is akin to the ability of an EPCI processor to run instructions in parallel. prediction removes the penalty of if – then – else and allows the if – then – else process to run with as fewer steps as possible. • A side benefit of prediction is that the removal of branches causes less branch mispredicts. Branch misprediction requires the pipeline to be flushed and this is very cycle expensive procedure. prediction reduces wasted processor time.
Wide Parallel Execution core • Itanium processors are very wide. • They are intended to run multiple instructions and operations in parallel. • Itanium processors will be deep with a ten stage pipeline. • The first generation itanium processor will be able to issue six EPIC instruction in parallel every clock cycle. • The six issue (two bundler) scheduler disperses instructions into nine functional slots, two integer slots, two memory slots and three branch slots, giving a total of nine dispersal slots.
Wide Parallel Execution core cont… • This limits the number of each type of instruction that can be assigned in a single clock cycle. If an instruction/s can not be executed because too many slots of one type are filled, the instructions are delayed until the next cycle. • This means that proper compiler design is crucial to functional aspect of the itanium. • Backing up the itanium six issue scheduler are eleven execution units; four integer, two floating points, three branch, two load/store units.
Wide Parallel Execution core cont… • This helps support the various EPIC instructions that can launch more than one operation in a single instruction, such as SIMD, floating point operations. • Combined with the EPIC instruction set the itanium can execute up to 20 operations in a single cycle when doing some floating point intensive task.
FPU, ALU and Rotating Registers • FPU • The Itanium contains 4 pipelined FMAC (Floating Point Multiple Add Calculator) units. There are an additional two FMACs tuned for 3D applications. They are each capable of processing up to two single-precision floating-point operations per clock. That yields another 3.2GFLOPS of single-precision processing power. All together, the Itanium has a theoretical max of 6.4GLOPS of single-precision floating point processing power.
FPU, ALU and Rotating Registers cont… • ALU • There are four pipelined ALUs (Arithmetic Logic Unit) in the original Itanium. Each can process one integer calculation per cycle. They can also process MMX type instructions. While the Itanium has the potential to be a massive floating-point powerhouse, its integer performance also has tremendous potential.
FPU, ALU and Rotating Registers cont… • Plentiful Registers • The Itanium will come with 128 floating point and 128 integer registers. When processing up to 20 operations in a single clock, the registers give plenty of room for data inside the processor. This reduces the chances of the execution of an instruction being delayed because data could not be held locally. This is especially important since the Itanium can process up to eight floating-point operations in a single clock. With the possibility of eight operations running in a single clock, having too few registers could be a serious bottleneck.
FPU, ALU and Rotating Registers cont… • The registers also have the ability to rotate. Rotating registers allows the processor to perform an operation on multiple software accessible registers in turn. • This increases CPU pipeline utilization and efficiency when dealing with streams of data to process.
Large Fast Cache • When a processor is waiting for data or instructions, time is wasted. The longer it takes for data and instructions to get to the CPU, the worse it gets. When data and instructions are in cache, the processor can grab them much quicker than when having to go to slow main memory. Not only is cache latency much lower than DRAM latency, the bandwidth is much higher.
Large Fast Cache cont… • There are some trick programming techniques in use out there to keep often-used data and instructions in cache and they are not the kind of techniques you learn in your high school BASIC course. • Still, the easiest way to keep data and instructions in cache is to have a lot of cache to keep them in. Intel knew that when they designed the Itanium.
Large Fast Cache cont… • The Itanium has three levels of cache. L1 and L2 are on-die while L3 is on cartridge. According to Intel, the L3 cache weighs in at 2MB or 4MB of four-way set associative cache on two or four 1MB chips. • IDC reports that the L2 cache size is 96k in size, and the L1 cache, which does not deal with floating point data, has a 16KB integer data and a 16KB instruction cache.
Large Fast Cache cont… • The 294.8 million transistors of (4MB) level three cache runs at the full processor speed, giving 12.8GBps of memory bandwidth at 800MHz. • With 2MB or 4MB of L3 cache on the Itanium, the chances of the required data and instructions being in cache are quite good, bus traffic can be reduced, and performance increases. With six pipelines hungry for instructions and data, the Itanium needs all the cache it can get.
Large Fast Cache cont… • To make caching even more effective, Intel uses data speculation and cache hints. Data speculation is caching and calling for data that may be needed or may be changed before it is needed, so that, in the case that the data is needed and it has not changed, the CPU does not have to take a latency impact from calling for the data. • The processor, with the help of compiled instructions, looks ahead, anticipates what info it may need, and then brings it to cache or into the processor. This helps hide memory latency. Cache hints are two-bit markers for memory loads set by the compiler that help the CPU find data in cache. This improves the speed of retrieving data from cache.
Clock Speed • The first generation of Itanium processors will come in the first half of 2001 at 733MHz and 800MHz. The first generation's clock speed may not be particularly quick, but Intel has several generations ahead of the Itanium already in the works that should increase performance. • Intel claims they have plenty of clock headroom in the Itanium design and are aiming for a greater than 1GHz clock speed with their second generation Itanium processor, McKinley, which will have the L3 cache on-die.
Scalability • The Itanium was not designed for small systems, it is intended for 1 to 4000 processor workstations and servers. • There are several Itanium features designed to help with hardware scalability: a full-CPU-speed Level 2 bus, a large L3 cache, deferred-transaction support and flexible page sizes.
Scalability cont… • The full-CPU-speed Level 3 bus provides quick communication between CPUs. The large L2 cache reduces inter-CPU bus traffic by keeping data close to the CPU that needs it. • Deferred-transaction support can stop one CPU from getting in the way of another. Flexible page sizes, from 4KB to 256MB, give the Itanium family the flexibility to access small amounts of memory in small chunks and massive amounts of memory in massive chunks without the overhead of smaller page sizes.
Scalability cont… • The first generation Itanium chipset, the 460GX, will support up to four processors, and OEMs will be able to build eight-way and larger systems. • Successive generations of chipsets should be successively more scalable. Third party solutions should also increase scalability.
Error Handling • The Itanium will have extensive error handling capabilities. It features ECC and parity error checking on most processor caches and busses. • If a machine error occurs and a piece of data becomes corrupted, the ECC or parity checking will allow the machine to recognize the error, fix it if possible, or flag it as corrupted. • The processor also has the capability to kill an application or thread that has experienced a machine error without having to reboot.
Error Handling cont… • Chipset, OS, and system designers, which will include the likes of HP, IBM, Compaq, SGI, Microsoft and Intel, will bring out their own error handling and reliability processes that should further enhance Itanium-based server uptime to 99.9% and beyond.
Fast Bus Architecture • A major link in the food delivery system for the Itanium is the system bus. The Itanium will use a 2.1GBps multi-drop system bus to keep well fed with data and instructions. We expect it will have a 128-bit 133MHz bus. • The memory subsystem and I/O will be determined by the chipset used. First generation systems should use dual-memory ported SDRAM giving 4.2GBps of memory bandwidth. Later generations will have the option to use DDR SDRAM or RDRAM.
Fast Bus Architecture cont… • Eventually, Intel plans on moving server platforms to DDR II. 64bit, 66MHz PCI and AGP Pro (4x) should be common on Itanium motherboards and support will be included in Intel's 460GX chipset
Future • According to Intel, the EPIC architecture was designed with about 25 years of headroom for future development in mind. • McKinley will follow the original Itanium and will integrate its L3 cache onto the CPU die. McKinley will arrive in the first half of 2002. Madison may also arrive in 2002 on a .13-micron process. Deerfield will arrive not long after, also on a .13 process, at a lower price and performance level but with more performance for the dollar than Madison.
Future cont… • Madison may also arrive in 2002 on a .13-micron process.Deerfield will arrive not long after, also on a .13 process, at a lower price and performance level but with more performance for the dollar than Madison. • Furthermore it will offer larger amounts of L3 cache. • Deerfield will be positioned as a value part in conjunction with Madison the same way as a P3 and Celeron compares today. It might be the CPU targeting consumer desktops.
Competition • Sun UltraSPARC • IBM PowerPC • Compaq’s Alpha • AMD’s Sledgehammer
Competition • Sun UltraSPARC • In 1995, 8 years after the first SPARC station was introduced, Sun went 64 bit with the introduction of UltraSPARC 1 RISC processor. The first model ran at 143Mhz and had 128 bit datapaths. • In 1996, it became the first 64 bit CPU to incoporate multimedia extensions to handle complex 2D/3D graphics.
Competition cont… • In 1997, the UltraSPARC 2 was released at 250Mhz while the UltraSPARC 3 (with new 256 bit data paths) is released in the second quarter of 2000. Other plans include a UltraSPARC 4 which will be pumped up to 1 Ghz and UltraSPARC 5 which will run at 1.5Ghz • In 1997, the UltraSPARC 2 was released at 250Mhz while the UltraSPARC 3 (with new 256 bit datapaths) is released in the second quarter of 2000. Other plans include a UltraSPARC 4 which will be pumped up to 1 Ghz and UltraSPARC 5 which will run at 1.5Ghz