PROCESSOR DESIGN Lan Jin Tsinghua University California State University-Fresno

PROCESSOR DESIGN Lan Jin Tsinghua University California State University-Fresno • Computer Architecture Arena • Design Requirements and Constraints • Development of IC Technology • High-Performance Processor Architectures • Electronic Design Automation • Embedded Computing • Cellular Computing

Computer Architecture Arena • Markets • Technology • Target applications

Design Requirements and Constraints Requirements • High performance • Low cost • Low power Constraints fueled by increasing IC density and switching speed • Power dissipation • Wire-length barrier • Design and verification complexities

Development of IC Technology • Moore’s Law ◊ Computing power becomes half as expensive every 18 to 24 months ◊ No. of transistors per chip doubles every ≈ 18 months. • Predicted IC by 2005 Roadmap of SIA ◊ 200 M transistors, 0.1µm feature size ◊ 2.0-3.5 GHz ◊ 0.9-1.2 V, dynamic power could be 150W!

Development of IC Technology(continued) The IC Design Paradigm • Transistor-component based design • Cell-based and RTL-based design • IP-based design for SOC development Key IC-Design Issues • Switching currents • due to extremely high switching speeds • Optimization • tighter constraints on speed, power, and cost • Asynchronization • global, tightly skewed vs. local self-timed signals • More reuse • Design skills and design automation

Development of IC Technology(continued) Cool-Chip Design Techniques • Dynamic power clock frequency x transistor switching activity x voltage2 • to lower transistor threshold voltage • Multithreshold voltage to minimize leakage • Speed-adaptive variable threshold circuit • Software-selectable voltage matching the speed • Software-controlled clock frequency • Turning units on/off individually and dynamically • A wide variety of sleep modes • Power monitoring circuit

High-Performance Processor Architectures • Straightforward approach - to add more of: ◊ on-chip multilevel cache and prefetch buffers ◊ hardware contexts and registers ◊ large distributed on-chip DRAM ◊ processors or processing elements • Extending traditional architectures ◊ a higher degree of ILP ◊ new possibility of prediction and speculation ◊ the ability of overcoming memory latencies ◊ on-chip multiprocessing and multithreading • Cooperating distributed system on a chip • Co-designed virtual machine • How to use a billion transistors on a chip?

High-Performance Processor Architectures • Advanced Superscalar: 16 or 32 instr/cycle • Superspeculative Processor: ◊ aggressive fine-grained speculation at every step • Simultaneous Multithread Processor (SMT) • Trace (multiscalar) Processor ◊ coarse-grained traces on distr. multiple cores • Vector Intelligent RAM proc. (V-IRAM) ◊ couple vector exec. with large, high-bw DRAM • On-chip Multiprocessor (CMP) ◊ the ability of overcoming memory latencies • RAW (configurable) Processor ◊ Compiler customizes the h/w to each application • Proposed New Processor Architectures

High-Performance Processor Architectures Extending ILP architecture • Deeper pipeline • Increasing use of prediction and speculation • Advanced superscalar processing • Wider instruction window • Highly intelligent optimizing compiler

High-Performance Processor Architectures Trace Processor - metahardware approach • H/w & runtime s/w monitor pgm’s behavior. • Using aggressive prediction and speculative techniques to recast the pgm into traces. • Multiple PEs exploit trace-level parallelism. • Metahardware can be implemented as various helper engines.

High-Performance Processor Architectures Instr.-Level Distributed Processing(ILDP) • Run distributed PUs at very high clock rate. • Simple logic of small PUs reduces critical • paths and overall IC area and wire lengths. • Partition pgm to maximize local comm. • Asynchronous multiclock domains. • Monitored by co-designed virtual machine.

High-Performance Processor Architectures Co-designed Virtual Machine for ILDP • Virtual machine monitor (VMM) is a hidden • layer of s/w, codesigned with h/w. • VMM manages ILDP resources through tight • interaction between h/w and low-level s/w. • dynamically optimizes executing threads based • on instr. dependencies and interinstr. comm.

High-Performance Processor Architectures Clustered Dependence-Based Architecture • organize PUs into clusters. • steer dependent instructions to the same cluster. • Within a cluster, further divide into instruction, • cache, integer, floating-point processing, etc. • Instructions within a cluster can be issued in • order, while in different clusters out of order.

Electronic Design Automation Why design automation? • EDA increases design productivity. • Time to market or time to prototype is crucial. • EDA reduces design cost, especially for low- • volume custom-designed product. • Automation Philosophy • Select a pareto optimal set from a design space. • Architectural framework and parameter range • define the design space. • Select lower-level components from a library. • Space walker explores the design space. • Constructor, simulator, evaluator, …

Electronic Design Automation PICO (Program In Chip Out) System • An architectural synthesis system written in C. • emits VHDL for h/w and compiled s/w code. • Optimality defined by gate count and exe time.

Embedded Computing Three Features of Embedded Architecture • Specialization • Customization • Automation and verification • Specialized architectures • SLI or SOC w/ rich diversity of custom designs. • Some GP arch. or OTS run special applications • optimally, e.g., multimedia vector applications. • Mostly domain- or application-specific system • relatively small, well-defined workloads • irregular determinant system configuration • minimized logic complexity and die size

Embedded Computing Customization • A level of specialization beyond OTS • better cost performance than OTS design • incurs three nonrecurring expenses (NRE) costs • reducing architectural costs by reusing soft IP • reducing phys. design costs by reusing IP blocks • reducing mask set costs by avoiding SLI design • Automation and Verification • System-level simulation to start a project. • EDA supports a sharp increase in IC complexity. • EDA reduces design costs. • EDA reduces time to market.

Cellular Computing Motivation for the CA Computing Paradigm • simple+vastly parallel+local=cellular computing • simple cell is the basic processor • parallelism on a much larger scale measured by 10x • local connectivity pattern carrying little information

Cellular (CP) vs. Parallel (PP) Processing Cellular Computing • PP: a small number of powerful processors • CP: a vast number of small processing cells • PP: global scheduling and synchronization • CP: Cell computes its next state as a function of its neighboring values. No one cell has a global view of the entire system. • PP: Global communication is slow. • CP: Local communication can be much faster. • PP: Sequential-to-parallel partitioning is difficult. • CP: potential of addressing much larger problems based on local interaction rules. • PP: centralized • CP: naturally distributed

Cellular Computing Specific Application Areas • Image processing • Pattern recognition • NP-complete problems from graph theory, network • design, VLSI simulation, pgm optimization, etc. • Random number generation, cryptography, compu- • tational physics, chemistry, biology. • Environmental modeling, landslide simulation, • social behavior, finance. • Molecular dynamics, molecular devices, nano- • scale calculating machines.

PROCESSOR DESIGN Lan Jin Tsinghua University California State University-Fresno