260 likes | 549 Views
AMD K7 Processor Architecture. CMPE 511 prepared by Özsun S. Sönmez. Introduction. AMD K7 is the first 7 th generation PC CPU. First six generations were 8086, 80286, 80386, 80486, Pentium (AMD K5/K6) and Pentium II (AMD K6-2/K6-3). It is designed to operate above 500MHz.
E N D
AMD K7 Processor Architecture CMPE 511 prepared by Özsun S. Sönmez
Introduction • AMD K7 is the first 7th generation PC CPU. First six generations were 8086, 80286, 80386, 80486, Pentium (AMD K5/K6) and Pentium II (AMD K6-2/K6-3). It is designed to operate above 500MHz. • AMD K7,also known as AMD Athlon, was introduced in the first half of 1999 and its architecture forms the basis for the subsequent Athlon XP versions untilthe release of K8 (AMD Hammer). • Its competitor, Intel Pentium III was also released in the same year and these two processors will be compared whenever possible throughout the presentation.
Main Features • Out-of-order, 3-way superscalar x86 uP • 9 independent execution pipelines, with 10 stage integer and 15-stage FP pipeline : • 3 Integer Execution Units • 3 Address Calculation Units • 3 Floating Point Execution Units • 64kB instruction and 64kB data L1 caches • Integrated L2 cache controller up to 8MB • Extended 3DNow! instructions
Main Features • K7 uses Digital™ Alpha™ EV6 system bus interface. This is probably the most important architectural difference from the previous generations. EV6 provides: - Use of both rising and falling edges, resulting in doubled bus speed - Scalability beyond 200MHz(beyond 400MHz bus speed) - Highest bandwidth of that time: Athlon using 100MHz(x2) 1.60 GB/s PIII using 133MHz 1.01 GB/s - 72(64 + 8ECC) bit data bus - Independent address bus able to address 8 terabytes - Independent snoop bus
Main Features – EV6 cont. - low-voltage signaling for low-cost motherboard implementations Motherboards with GeForce, Dolby and Ethernet available below $80. - Point-to-Point topology with clock forwarding for scalable multiprocessing.
Cache Architecture • Separate L1 instruction and data caches • Both are 64kB, 64-bit, 2-way set associative, dual ported and have 24-entry(32-entry for DC) L1 TLB, 256-entry L2 TLB. • IC stores predecode information to assist multiple instruction decoders. • L2 cache controller can interface up to 8MB industry standard SDR or DDR SRAMs and provides full tag for 512kB cache or partial tag for larger caches. Interface is 64+8ECC
Cache Competition • AMD Athlon(1999): • 2x64kB, 64-bit, 2-way, 3~ L1 cache with 64-byte lines • 512kB, 64-bit, 2-way, 18~ off-chip L2 with 64-byte lines • Intel PIII Katmai(1999): • 2x16kB, 64-bit, 4-way, 3~ L1 cache with 32-byte lines • 512kB, 64-bit, 2-way, 21~ off-chip L2 with 32-byte lines • Intel PIII Coppermine(1999): L2 changed to • 256kB, 256-bit, 8-way, 4~ on-chip • AMD Athlon Thunderbird(2000):L2 changed to • 256kB, 64-bit, 16-way, 7~ on-chip • Exclusive cache structure meaning that data in L1 and L2 caches are different
Pipeline Architecture - Decoders - 3-way Decoders convert instructions into fixed-length “Macro-Ops” (or MOPs) and send to ICU - ICU contains 72 entries vs. 20 entries of PIII superior out-of-order execution performance
Pipeline – Integer Execution Units • 3 IEU, 3 AGU • 15 entry integer scheduler • 24 entry 32bit 9 read 8 write • register file
Pipeline - Floating Point Unit • Floating Point Units execute MMX, • x87 (FP) and 3D-Now! Instructions • 36 entry FP scheduler • 88 entry 90bit 5 read 5 write register file. • Some stages of the MUL pipeline may be unused during DIV/Sqrt iterations. ICU informs the FP scheduler in such cases so that there is sufficient time to schedule independent MULs in the unused cycle. • - DIV by exact 2n or zero takes 11~
Pipeline – Load/Store Unit • 44 entry Load/Store queue • Data forwarding from stores to dependent loads
Branch Prediction • Dynamic branch prediction logic composed of: • Branch prediction table: two-way, 2048-entry(512 for PIII). BPT stores prediction information that is used for predicting the direction of conditional branches. • Branch target address table: stores target addresses of conditional and unconditional branches.
Branch Prediction • Return address stack: 12-entry optimizes CALL/RET instruction pairs • BPT is accessed during Fetch stage and prediction is made during scan stage using Smith Prediction Algorithm (2-bit counters) • Misprediction penalty is 10 cycles • Approximate Correct Branch Predictions • AMD Athlon: 95% • Intel Pentium III: 90-92%
3DNow! Technology • 3DNow! is a set of SIMD instructions designed to accelerate the FP-intensive multimedia applications. • Instructions operate on two packed single-precision 32-bit doublewords simultaneously: Dst[63:32] = Dst[63:32] op Src[63:32] Dst[31:00] = Dst[31:00] op Src[31:00]
3DNow! Technology • With significant code analysis, AMD engineers found that there are two compelling implementation alternatives: - extending MMX with 3DNow! instructions - using separate wide registers from MMX, 4-operand instruction format and support for MAC. - Anything in between requires significantly greater hardware area or complexity without providing a corresponding performance benefit. • AMD chose the first one that achieves most of the performance benefit with significantly less area and power. Since no additional registers are used, no new states are introduced compatibility with the existing OSs. • The second choice is implemented in PowerPC G4 under the name AltiVec.
3DNow! Technology • Instead of division and sqrt, reciprocal and reciprocal sqrt are implemented in AMD K7 since they are encountered more often in multimedia applications. • MMX and 3DNow! instructions have at most 4 cycle latency (only for 3DNow! Add and Mul ) and 1 cycle throughput. This is much faster than single precision FP division(13~) and sqrt(16~). • Using 2 FP pipelines simultaneously, maximum throughput is 4 FPops/~.
Conclusion • Being the first 7th generation CPU, AMD K7 has been a major leap forward in the CPU history. • It had both performance and cost benefits when compared to Intel PIII and started the competition that ended with today’s AMD Athlon XP and P4 processors.
References • Hesley, S., V. Andrade, B. Burd,G. Constant, J. Correll, M. Crowley, M. Golden, N. Hopkins, S. Islam, S. Johnson, R. Khondker, D. Meyer, J. Moench, H. Partovi, R. Posey, F. Weber and J. Yong, “A 7th Generation x86 Microprocessor ”, IEEE International Solid State Circuits Conference, pp. 92-93,1999. • Scherer, A., M. Golden, N. Juffa, S. Meier, S. Oberman, H. Partovi and F. Weber, “ An Out-of-Order Three-Way Superscalar Multimedia Floating Point Unit ”, IEEE International Solid State Circuits Conference, pp. 94-95,1999. • Oberman, S., “ Floating Point Division and Square Root Algorithms and Implementation in the AMD-K7 Microprocessor ”, 14th IEEE Symposium on Computer Arithmetic, pp. 106-115, 1999. • Oberman, S., G. Favor and F. Weber, “ AMD 3DNow! Technology: Architecture and Implementations ”, IEEE Micro, 1999. • AMD Athlon Processor Datasheet and Technical Brief from www.amd.com • Intel PIII Processor Datasheet from www.intel.com