190 likes | 323 Views
UltraSPARC III. Hari P. Ananthanarayanan Anand S. Rajan. Presentation Outline. Background Introduction to the UltraSPARC Instruction Issue Unit Integer Execute Unit Floating Point Unit Memory Subsystem. Introduction. 3 rd generation of Sun Microsystems’ 64 bit SPARC V9 architecture
E N D
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan
Presentation Outline • Background • Introduction to the UltraSPARC • Instruction Issue Unit • Integer Execute Unit • Floating Point Unit • Memory Subsystem
Introduction • 3rd generation of Sun Microsystems’ 64 bit SPARC V9 architecture • Design Target • 600 MHz • 70-watt power dissipation @ 1.8V • 0.25-micron process with 6 metal layers • Transistors Count - 12 million (RAM) 4 million (Logic) • Die size of 360mm2
A Tour of the UltraSPARC • 14 stage pipeline • Instruction Issue Unit occupies stages A through J • Integer Execution Unit - stages R through D • Data Cache Unit – stages E through W • Floating Point Unit – stages E through D
Design Goals • Minimum latency for integer execution path, determines cycle time - limit stage size to approximately 8 logic gates • Minimize performance degradation due to clock overhead, e.g. - On-chip caches are wave pipelined • Minimize branch misprediction latency – use of miss queue
Instruction Issue Unit • UltraSparc III is a static speculation machine. Compiler makes the speculation path sequential, results in fewer requirements on the Fetch • Stage A contains a small, 32-byte buffer to support sequential prefetching into instruction cache • I-cache access over 2 cycles (P and F) , it is wave pipelined Pipeline
Instruction Issue Unit – Contd. • ITLB and branch prediction mechanism overlapped with I-cache access • Target address is generated only in Stage B and redirected to Stage A if taken • 20 entry instruction queue and 4-entry miss queue. Latter stores alternate execution path to mitigate effects of misprediction • Stages I and J used to decode and dispatch instructions; scoreboarding is used to check for operand dependency. Pipeline
Branch Prediction Mechanism • Slightly modified Gshare algorithm with 16K saturating 2-bit counters – the three low order index bits into predictor use PC info only • 8 cycle misprediction delay, need to drain stages Pipeline
Integer Execute Unit • Executes loads, stores, shift, arithmetic, logical and branch instructions • 4 integer executions per cycle – 2 from (arithmetic/logical/shift), 1 from load/store and 1 branch • Entire data path uses dynamic precharge circuits – this is the E stage • Future file technique to handle exceptions – we have working and architectural register files (WARF) Pipeline
Integer Execute Unit – Contd. • Integer execution accesses data from WRF in the R stage and writes to it in C stage. • ARF copied into WRF in case of exceptions. • Results are committed into ARF at the end of the pipe. • Integer multiply and divide are not pipelined and are executed in the ASU; strategy is to decouple less frequently executed instructions. Pipeline
Floating Point Unit • Floating point and partitioned fixed point (graphics) instructions • 3 datapaths • 4 stage divide/multiply • 4 stage add/subtract/compare • Unpipelined divide/square root • Push FPU by one stage to keep integer unit compact (counter the effect of wire delays) Pipeline
Memory – L1 Data Cache • 64 KB, 4-way, 32-byte line • 2 cycle access time – Wave pipelined • Sum addressed memory (SAM) – combines address addition and word line decode Pipeline
Memory - Prefetch Cache • 2 KB, 2 way, 64-byte line • Multi-ported SRAM • Streaming data possible (similar to stream buffers) • Detects striding loads – hardware prefetch issued independent of software prefetch Pipeline
Memory – Write Cache • 2 KB, 4 way, 64-byte line • Reduce bandwidth due to store traffic • Sole source of on-chip dirty data – easy to handle on-chip cache consistency • Write-validate scheme- multiplex between L2 bytes and write-cache bytes for loads Pipeline
External Memory Interface • L2 Cache – Direct-mapped, Unified Data and Instruction, 12 cycle access time • Cache controller allows programmable support of 4 MB or 8 MB • On-chip Main Memory Controller • On-chip Tags – allow associative L2 cache design without latency penalty Pipeline