330 likes | 536 Views
CSCE 930 Advanced Computer Architecture. Lecture 1 Evaluate Computer Architectures Dr. Jun Wang. Computer Architecture Trends. Figure 1.1 H&P Growth in microprocessor 35% per year. Technology Trends. Smaller feature sizes – higher speed, density. Density is increased by 77 times.
E N D
CSCE 930 Advanced Computer Architecture Lecture 1 Evaluate Computer Architectures Dr. Jun Wang
Technology Trends • Smaller feature sizes – higher speed, density Density is increased by 77 times
Technology Trends • Larger chips • Trend is toward more RAM, less logic per chip • Historically 2x per generation; leveling off? • McKinley has large on-chip caches => larger wafers to reduce fabricate costs
Moore’s Law • Number of transistors doubles every 18 months (amended to 24 months) • Combination of both greater density and larger chips
Tech. Trends, contd. • More, faster, cheaper transistors have fed an application demand for higher performance • 1970s -- serial, 1-bit integer microprocessors • 1980s -- pipelined 32-bit RISC • ISA simplicity allows processor on chip • 1990s -- large, superscalar processors, even for CISC • 2000s -- multiprocessors on a chip
latch latch latch latch latch latch IF ID EX ME WB clock Pipelining and Branch Prediction • Two basic ways of increasing performance • Pipelining: • Branch Prediction • Speculate on branch outcome to avoid waiting
Tech. Trend: memory sizes • Memories have grown very dense • Feeding application demand for large, complex software
Tech. Trend: memory speeds • Main memory speeds have not kept up with processor speeds
Memory Hierarchies • Gap between processor and memory performance has led to widespread use of memory hierarchies • 1960s no caches, no virtual memory • 1970s shared I & D-cache, 32-bit virtual memory • 1980s Split I- and D-caches • 1990s Two level caches, 64-bit virtual memory • 2000s Multi-level caches, both on and off-chip
Memory Hierarchies Large/Slow Main Memory MEMORY SYSTEM L3 Cache L2 Cache PROCESSOR L1 Cache Registers Small/Fast
I/O a key system component • I/O has evolved into a major distinguishing feature of computer systems • 1960s: disk, tape, punch cards, tty; batch processing • 1970s: character oriented displays • 1980s: video displays, audio, increasing disk sizes, beginning networking • 1990s: 3D graphics; networking a fundamental element; high quality audio • 2000s: real-time video, immersion…
I/O Systems Proc • A hierarchy that divides bandwidth DRAM Controller interface • Data rates • Memory: 100 MHz, 8 bytes 800 MB/s (peak) • PCI: 33 MHz, 4 bytes wide 132 MB/s (peak) • SCSI: “Ultra2” (40 MHz), “Wide” (2 bytes) 80 MB/s (peak) Local Bus Interface High Speed I/O bus Frame Expansion Controller Buffer Hard Drives Monitor LAN Slow Speed I/O bus Floppy Floppy CD ROM
. . . M M M InterconnectionNetwork C C C . . . P P P Multiprocessors • Multiprocessors have been available for decades… • 1960s small MPs • 1970s small MPs • Dream of automatic parallelization • 1980s small MPs; emergence of servers • Dream of automatic parallelization • 1990s expanding MPs • Very large MPPs failed • Dream of automatic parallelization fading • 2000s wide-spread MPs; on-chip multithreading • Many applications have independent threads • Programmers write applications to be parallel in the first place
Computation Science • Computation is synthetic • Many of the phenomena in the computing field are created by humans rather than occurring naturally in the physical world • Very different from nature sciences »When one discovers a fact about nature, it is a contribution, no matter how small » Creating something new alone does not establish a contribution • Anyone can create something new in a synthetic field • Rather, one must show that the creation is better
What Means “Better”? • “Better” can mean many things • Solves a problem in less time (faster) • Solves a larger class of problems (more powerful) • Is more efficient of resources (cheaper) • Is less prone to errors (more reliable) • Is easy to manage/program (lower human cost)
Amdahl's Law • Defines the Speedup that can be gained by using a special feature • Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = ------------------- = ------------------------ ExTime w/ E Performance w/o E F E • Find how Speedup coming from some enhancement E • Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected
Amdahl’s Law ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced 1 ExTimeold ExTimenew Speedupoverall = = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced
The “better” property is not simply an observation • Rather, the research will postulate that a new idea • An architecture, algorithm, protocol, data structure, methodology, language, optimization or model, etc. • Will lead to a “better” result • Making the connection between the idea and the improvement is as important as quantifying how much the improvement is • The contribution is the idea, and is generally a component of a larger computational system.
How to Evaluate Architecture Ideas • Measuring/observing/analyzing real systems • Accurate results • Need a working system » Too expensive to evaluate architecture/system ideas
Analytic models • Fast & easy analysis of relations • Tprogram = NumOfInst (Tcpu+ Tm(1- Cachehit)) • Allows extrapolation to ridiculous parameters, e.g. thousands of processors • Sometimes infeasible to obtain accuracy (e.g. modeling caches) • To obtain reasonable accuracy, the models may become very complex (e.g. modeling of network contention) • Queuing theory is a commonly used technique
Simulation • The most popular method in computer architecture and system research • Mimic the architecture/system using software • Very flexible: nearly unlimited evaluation • Prototyping of non-existing machines possible • Evaluation of design options (design space exploration) cheap & flexible • Requires some sort of validation • Can be VERY slow
Tradeoff between accuracy and computational intensity • Low level of abstraction slow (e.g. simulating at the level of gates) • High level of abstraction fast (e.g. only simulating processor, cache and memory components) • Tradeoff may be intensified when modeling parallel architectures as multiple processors need to be simulated
Three Simulation Techniques • Profile-based static modeling • Simplest and least costly • Use hardware counters on the chip or instrumented execution (such as Beowulf Linux cluster Pgprof, SGI perfex and Alpha ATOM) • Trace-driven • A more sophisticated technique • How it works (Ex. modeling memory system performance): • Collect traces generated by ATOM • Trace format: inst address executed, data address accessed • Build the memory hierarchy model • Feed trace in the simulation model and analyze results
1. Compile: pgcc –Mprof=func prg.cc 2. Run the code: to produce a profile data file called pgprof.out 3. View the execution profile: pgprof prprof.out
Using Perfex • Usage : perfex [-e num] [-y] program [program args] -e num: count only event type num; -y: generate a “cost report”; • Example perfex –e 41 –13 –y a.out EVENT # Event Events Counted 41 Floating point OP retired 25292884493 13 L2 cache lines loaded 223490870 Statistics: MFLOPS 29.175907 Main memory L2 bandwidth 8.249655 MB/s
Execution–driven • The most accurate and most costly • Trace-driven can not simulate the interaction between memory system and processor • A detailed of the memory system and the processor pipeline are done simultaneously by really executing program on top of a simulation framework like Simics, SimOS and SimpleScalar
Measuring by Means of Benchmarks • Micro-benchmarks (e.g. instruction latencies, file system throughput) • Application benchmarks: general system behavior (e.g. Spec2000 or SPLASH2) • Only limited evaluation possible (e.g. limited systems support for measurement) • The machine must be available • Benchmark Suites: Collection of kernels, real and benchmark programs, lessening the weakness of any one benchmark by the presence of others.
Summarize Results • Weighted Arithmetic Mean Execution Time (Wi*Ti) • Summarize the products of weighting factors and execution times and reflect individual frequency of each workload • Wi = 1/(Timei * nj=1(1/Timej)) • Geometric Mean Execution Time (Ti/Ni)1/n • Normalize execution times to a reference machine and take the average of normalized execution times • Used by SPEC