300 likes | 635 Views
Future of the Microprocessors. “ Billion-Transistor Architectures” IEEE Computer, September 1997. Billion-Transistor Architectures. Future Trends Hardware trends and physical limits In the 1994 road map, the Semiconductor Industry Association predicted
E N D
Future of the Microprocessors “Billion-Transistor Architectures” IEEE Computer, September 1997
Billion-Transistor Architectures • Future Trends • Hardware trends and physical limits • In the 1994 road map, the Semiconductor Industry Association predicted • by 2010, 800 million Trs with thousands of pins, 1000-bit bus, and clock speeds over 2 GHz • 180 W • On-chip wires are becoming much slower relative to logic gates • impossible to maintain one global clock over the entire chip • sending signals across a billion trs : as many as 20 cycles • System software • Whether will HW alone continue to extract parallelism? • Compatibility with legacy softwares
Future workloads • Architectural design is driven by the dominant anticipated workload • multimedia workloads • Design, verification, and testing • complex: hundreds of engineers • validation and testing: 40% to 50% of an Intel chip’s design cost and 6% of the transistors • Economies of scale • Fabrication plants : $ 2 billion (a factor of ten more than a decade ago) • need larger markets: mass marketing of computer chips
Future Architectures Advanced superscalar processors Simultaneous multithreaded processors Vector IRAM processors Raw(configurable) processors Superspeculative processors Trace(multiscalar) processors Chip multiprocessors Wire delays become dominant, forcing HW to be more distributed System software(compilers) becomes better at exploiting parallelism Workloads come to contain more exploitable parallelism Design and validation costs become more limiting Trends
Advanced Superscalar • One Billion Transistors, One Uniprocessor, One Chip • U of Michigan • Billion transistor processors will be much as they are today • Bigger, faster, and wider • Out-of-order fetching, Multi-Hybrid branch predictors, and trace caches • Large, out-of-order-issue instruction window (2,000 instructions), clustered banks of functional units • The current uniprocessor model can provide sufficient performance and use a billion transistors effectively without the programming model or discarding software compatibility.
One Billion Transistors, One Uniprocessor, One Chip 60 M for execution core 240 M for trace cache 48 M for branch predictor 32 M for data cache 640 M for L2 cache
Superspeculative • Superspeculative Microarchitecture for Beyond AD 2000 • CMU • Billion-transistor uniprocessor • Massive speculation at all levels to improve performance • Trace caches and advanced branch prediction • Without this much speculation, future processors will be limited by true dependences • Their investigations discovered large speedups on code that have traditionally not been ameanable to finding ILP
Simultaneous Multithreading • Simultaneous Multi-Threading(SMT) Processor • Wide-issue superscalar + Multithreaded Processor • multiple issues per cycle • HW for multithreads (registers, PC, and so on) • Exploit all types of parallelism • Within a thread • Among threads
Trace • Trace Processors: Moving to Fourth-Generation Microarchitectures • U of Wisconsin-Madison • Multiple, distributed on-chip processor cores • Each of the cores simultaneously executes a different trace • All but one core executes the traces speculatively, having used branch prediction to select traces that follow the one executing • It does not require explicit compiler supports • Rely heavily on replication, hierarchy, and prediction
Vector IRAM • Vector IRAM • U of California, Berkeley • Intelligent RAM(IRAM) • To increase the on-chip memory capacity by using DRAM instead of SRAM • The resultant on-chip memory capacity • High memory bandwidth • cost-effective vector processing
A Single-Chip Multiprocessor • Single-Chip Multiprocessor • Stanford University • Multiple (four to 16) simple, fast processors on one chip • each processor is tightly coupled to a small, fast, level-one cache • all processor share a larger level-two cache • a parallel job or independent tasks • Simpler design, faster validation, cleaner functional partitioning, and higher theoretical peak performance • Compilers will have to make code explicitly parallel • Old ISAs will be incompatible with this architecture
Raw Processor • Baring It All to Software: RAW Machines • MIT • The most radical architecture • Highly parallel architectures with hundreds of very simple processors coupled to a small portion of the one-chip memory • Each processor or tile • a small bank of configurable logic, allowing synthesis of complex operations directly in configurable HW • Compiler’s efficacy • does not use a traditional instruction set architecture • all units are told explicitly what to do by the compiler • the compiler even schedules most of the intertile communication
Instruction Memory(IMEM), data memory(DMEM), an arithmetic logic unit(ALU), configurable logic(CL), and a programmable switch with its associated instruction memory(SMEM)
a) Raw processors distributed the register file and memory ports and communicate between ALUs on a switched, point-to-point interconnect b) A superscalar c) Multiprocessors