550 likes | 702 Views
Herbert G. Mayer, PSU CS Status 6/29/2014. CS 201 Computer Systems Programming Chapter 3 “ Architecture Overview ”. Syllabus. Computing History Evolution of Microprocessor µP Performance Processor Performance Growth Key Architecture Messages Code Sequences for 3 Different Architectures
E N D
Herbert G. Mayer, PSU CS Status 6/29/2014 CS 201Computer Systems ProgrammingChapter 3“Architecture Overview”
Syllabus • Computing History • Evolution of Microprocessor µP Performance • Processor Performance Growth • Key Architecture Messages • Code Sequences for 3 Different Architectures • Dependencies, AKA Dependences • Score Board • References
Computing History Long, long before 1940s: 1643 Pascal’s Arithmetic Machine About 1660 Leibnitz Four Function Calculator 1710 -1750 Punched Cards by Bouchon, Falcon, Jacquard 1810 Babbage Difference Engine, unfinished; 1st programmer ever in the world was Ada, poet Lord Byron’s daughter, after whom the language Ada was named: Lady Ada Lovelace 1835 Babbage Analytical Engine, also unfinished 1920 Hollerith Tabulating Machine to help with census in the USA
Computing History Decade of 1940s 1939 – 1942 John Atanasoff built programmable, electronic computer at Iowa State University 1936 - 1945 Konrad Zuse’s Z3 and Z4, early electro-mechanical computers based on relays; colleague advised use of “vacuum tubes” 1946 John von Neumann’s computer design of stored program 1946 Mauchly and Eckert built ENIAC, modeled after Atanasoff’s ideas, built at University of Pennsylvania: Electronic Numeric Integrator and Computer, 30 ton monster 1980s John Atanasoff got acknowledgment and patent officially
Computing History Decade of the 1950s Univac Uniprocessor based on ENIAC, commercially viable, developed by John Mauchly and John Presper Eckert Commercial systems sold by Remington Rand Mark III computer Decade of the 1960s IBM’s 360 family co-developed with GE, Siemens, et al. Transistor replaces vacuum tube Burroughs stack machines, compete with GPR architectures All still von Neumann architectures 1969 ARPANET Cache and VMM developed, first at Manchester University
Computing History Decade of the 1970s Birth of Microprocessor at Intel, see Gordon Moore High-end mainframes, e.g. CDC 6000s, IBM 360 + 370 series Architecture advances: Caches, virtual memories (VMM) ubiquitous, since real memories were expensive Intel 4004, Intel 8080, single-chip microprocessors Programmable controllers Mini-computers, PDP 11, HP 3000 16-bit computer Height of Digital Equipment Corp. (DEC) Birth of personal computers, which DEC misses!
Computing History Decade of the 1980s decrease of mini-computer use 32-bit computing even on minis Architecture advances: superscalar, faster caches, larger caches Multitude of Supercomputer manufacturers Compiler complexity: trace-scheduling, VLIW Workstations common: Apollo, HP, DEC’s Ken Olsen trying to catch up, Intergraph, Ardent, Sun, Three Rivers, Silicon Graphics, etc.
Computing History Decade of the 1990s Architecture advances: superscalar & pipelined, speculative execution, ooo execution Powerful desktops End of mini-computer and of many super-computer manufacturers Microprocessor powerful as early supercomputers Consolidation of many computer companies into few larger ones End of USSR marked the demise of several supercomputer companies
Processor Performance Growth Moore’s Law --from Webopedia 8/27/2004: “The observation made in 1965 by Gordon Moore, co-founder of Intel, that the number of transistors per square inch on integrated circuits had doubled every year since it was invented. Moore predicted that this trend would continue for the foreseeable future. In subsequent years, the pace slowed down a bit, but data density doubled approximately every 18 months, and this is the current definition of Moore's Law, which Moore himself has blessed. Most experts, including Moore himself, expect Moore's Law to hold for another two decades. Others coin a more general law, a bit lamely stating that “the circuit density increases predictably over time.”
Processor Performance Growth So far in 2014, Moore’s Law is holding true since ~1968. Some Intel fellows believe that an end to Moore’s Law will be reached ~2018 due to physical limitations in the process of manufacturing transistors from semi-conductor material. Such phenomenal growth is unknown in any other industry. For example, if doubling of performance could be achieved every 18 months, then by 2001 other industries would have achieved the following: Cars would travel at 2,400,000 Mph, and get 600,000 MpG Air travel LA to NYC would be at 36,000 Mach, take 0.5 seconds
Message 1: Memory is Slow The inner core of the processor, the CPU or the µP, is getting faster at a steady rate Access to memory is also getting faster over time, but at a slower rate. This rate differential has existed for quite some time, with the strange effect that fast processors have to rely on progressively slower memories –relatively speaking Not uncommon on MP server that processor has to wait >100 cycles before a memory access completes; that is one single memory access. On a Multi-Processor the bus protocol is more complex due to snooping, backing-off, arbitration, thus the number of cycles to complete a memory access can grow high IO simply compounds the problem of slow memory access
Message 1: Memory is Slow Discarding conventional memory altogether, relying only on cache-like memories, is NOT an option for 64-bit architectures, due to the price/size/cost/power if you pursue full memory population with 264 bytes Another way of seeing this: Using solely reasonably-priced cache memories (say at < 10 times the cost of regular memory) is not feasible: resulting physical address space would be too small, or price too high Significant intellectual efforts in computer architecture focuses on reducing the performance impact of fast processors accessing slow, virtualized memories All else except IO, seems easy compared to this fundamental problem! IO is even slower by further orders of magnitude
Message 1: Memory is Slow µProc 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) 10 DRAM 7%/yr. DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 1988 2002 Time Source: David Patterson, UC Berkeley
Message 2: Events Tend to Cluster A strange thing happens during program execution: Seemingly unrelated events tend to cluster memory accesses tend to concentrate a majority of their referenced addresses onto a small domain of the total address space. Even if all of memory is accessed, during some periods of time such clustering is observed. Intuitively, one memory access seems independent of another, but they both happen to fall onto the same page (or working set of pages) We call this phenomenon Locality! Architects exploit locality to speed up memory access via Caches and increase the address range beyond physical memory via Virtual Memory Management. Distinguish spacialfrom temporal locality
Message 2: Events Tend to Cluster Similarly, hash functions tend to concentrate an unproportionallylarge number of keys onto a small number of table entries Incoming search key (say, a C++ program identifier) is mapped into an index, but the next, completely unrelated key, happens to map onto the same index. In an extreme case, this may render a hash lookup slower than a sequential, linear search Programmer must watch out for the phenomenon of clustering, as it is undesired in hashing!
Message 2: Events Tend to Cluster Clustering happens in all diverse modules of the processor architecture. For example, when a data cache is used to speed-up memory accesses by having a copy of frequently used data in a faster memory unit, it happens that a small cache suffices to speed up execution Due to Data Locality (spatial and temporal). Data that have been accessed recently will again be accessed in the near future, or at least data that live close by will be accessed in the near future Thus they happen to reside in the same cache line. Architects do exploit this to speed up execution, while keeping the incremental cost for HW contained. Here clustering is a valuable phenomenon
Message 3: Heat is Bad Clocking a processor fast (e.g. > 3-5 GHz) can increase performance and thus generally “is good” Other performance parameters, such as memory access speed, peripheral access, etc. do not scale with the clock speed. Still, increasing the clock to a higher rate is desirable Comes at the cost of higher current, thus more heat generated in the identical physical geometry (the real-estate) of the silicon processor or also the chipset But the silicon part acts like a heat-conductor, conducting better, as it gets warmer (negative temperature coefficient resistor, or NTC). Since the power-supply is a constant-current source, a lower resistance causes lower voltage, shown as VDroop in the figure below
Message 3: Heat is Bad This in turn means, voltage must be increased artificially, to sustain the clock rate, creating more heat, ultimately leading to self-destruction of the part Great efforts are being made to increase the clock speed, requiring more voltage, while at the same time reducing heat generation. Current technologies include sleep-states of the Silicon part (processor as well as chip-set), and Turbo Boostmode, to contain heat generation while boosting clock speed just at the right time Good that to date Silicon manufacturing technologies allow the shrinking of transistors and thus of whole dies. Else CPUs would become larger, more expensive, and above all: hotter.
Message 4: Resource Replication Architects cannot increase clock speed beyond physical limitations One cannot decrease the die size beyond evolving technology Yet speed improvements are desired, and must be achieved This conflict can partly be overcome with replicated resources! But careful! Why careful? Resources could be used for better purpose!
Message 4: Resource Replication Key obstacle to parallel execution is data dependence in the SW under execution. A datum cannot be used, before it has been computed Compiler optimization technology calls this use-def dependence (short for use-before-definition), AKA true dependence, AKA data dependence Goal is to search for program portions that are independent of one another. This can be at multiple levels of focus
Message 4: Resource Replication At the very low level of registers, at the machine level –done by HW; see also score board At the low level of individual machine instructions –done by HW; see also superscalar architecture At the medium level of subexpressions in a program –done by compiler; see CSE At the higher level of several statements written in sequence in high-level language program –done by optimizing compiler or by programmer Or at the very high level of different applications, running on the same computer, but with independent data, separate computations, and independent results –done by the user running concurrent programs
Message 4: Resource Replication Whenever program portions are independent of one another, they can be computed at the same time: in parallel; but will they? Architects provide resources for this parallelism Compilers need to uncover opportunities for parallelism If two actions are independent of one another, they can be computed simultaneously Provided that HW resources exist, that the absence of dependence has been proven, that independent execution paths are scheduled on these replicated HW resources
The 3 Different Architectures Single Accumulator Architecture Has one implicit register for all/any operations: accumulator Operations frequently require intermediate temps! Code relies heavily on load-store to-from temps Three-Address GPR Architecture Allows complex operations with multiple operands all in one instruction Hence complex opcode bits Stack Machine Architecture Operands are implied on the stack, except load/store Hence all operations are simple, few bits, but all are memory accesses
Code 1 for 3 Different Architectures Example 1: Object Code Sequence Without Optimization Strict left-to-right translation, no smarts in mapping Consider non-commutative subtraction and division operators We’ll use no common subexpression elimination (CSE), and no register reuse Conventional operator precedence For Single Accumulator SAA, Three-Address GPR, Stack Architectures Sample source: d ( a + 3 ) * b - ( a + 3 ) / c
Code 1 for 3 Different Architectures Three-address code looks shortest, w.r.t. number of instructions Maybe optical illusion, must also consider number of bits for instructions Must consider number of I-fetches, operand fetches, total number of stores Numerous memory accesses on SAA (Single Accumulator Architecture) due to temporary values held in memory Most memory accesses on SA (Stack Architecture), since everything requires a memory access Three-Address architecture immune to commutativity constraint, since operands may be placed in registers in either order No need for reverse-operation opcodes for Three-Address architecture Decide in Three-Address architecture how to encode operand types
Code 2 for Different Architectures This time we eliminate common subexpression (CSE) Compiler handles left-to-right order for non-commutative operators on SAA Better:d ( a + 3 ) * b - ( a + 3 ) / c
Code 2 for Different Architectures Single Accumulator Architecture (SAA) optimized still needs temporary storage; uses temp1 for common subexpression; has no other register for temps!! SAA could use negate instruction or reverse subtract Register-use optimized for Three-Address architecture Common subexpresssion optimized on Stack Machine by duplicating dup, exchanging xch 20% reduced for Three-Address, 18% for SAA, only 8% for Stack Machine
Code 3 for Different Architectures Analyze 2 similar expressions but with increasing operator precedence left-to-right, in 2nd case precedences are overridden by ( ) One operator sequence associates right-to-left, due to arithmetic precedence Compiler uses commutativity The other left-to-right, due to explicit parentheses ( ) Use simple-minded code generation model: no cache, no optimization Will there be advantages/disadvantages caused by the architecture? Expression 1 is:e a + b * c ^ d
Code 3 for Different Architectures • Expression 1 is: e a + b * c ^ d Expression 1 is : e a + b * c ^ d
Code 3 for Different Architectures • Expression 2 is: f ( ( g + h ) * i ) ^ j
Code For Stack Architecture Stack Machine with no register inherently slow, due to: Memory Accesses!!! Implement few top of stack elements via HW shadow registers Cache Let us then measure equivalent code sequences with and without consideration for cache Top-of-stack register tosidentifies the last valid word on physical stack Two shadow registers may hold 0, 1, or 2 true top words Top of stack cache counter tcc specifies number of shadow registers actually used Thus tos plus tcc jointly specify true top of stack
Code For Stack Architecture Timings for push, pushlit, add, pop operations depend on tcc Operations in shadow registers fastest, typically 1 cycle, include register access and the operation itself Generally, memory access adds 2 cycles For stack changes use some defined policy, e.g. keep tcc 50% full Table below refines timings for stack with shadow registers Note: push x into cache with free space requires 2 cycles, which are for the memory fetch: cache adjustment is done at the same time as memory fetch
Code For Stack Architecture Code emission for: a + b * c ^ ( d + e * f ^ g ) Let + and * be commutative, by language rule Architecture here has 2 shadow registers, compiler exploits this Assume initially empty 2-word cache
Code For Stack Architecture # 1 Left - to - Right cycles 1 2 Exploit Cache cycles 2 1 push a 2 push f 2 2 push b 2 push g 2 3 push c 4 e xpo 1 4 push d 4 push e 2 5 push e 4 m ult 1 6 push f 4 push d 2 7 push g 4 add 1 8 expo 1 push c 2 9 mult 3 r_expoo e = swap + expo 1 10 add 3 push b 2 11 expo 3 m ult 1 12 m ult 3 push a 2 13 a dd 3 a dd 1
Code For Stack Architecture Blind code emission costs 40 cycles; i.e. not taking advantage of tcc knowledge: costs performance Code emission with shadow register consideration costs 20 cycles True penalty for memory access is worse in practice Tremendous speed-up always possible when fixing system with severe flaws Return of investment for 2 registers is twice the original performance Such strong speedup is an indicator that the starting architecture was poor Stack Machine can be fast, if purity of top-of-stack access is sacrificed for performance Note that indexing, looping, indirection, call/return are not addressed here
Register Dependencies Inter-instruction dependencies, in CS parlance also known as dependences, arise between registers being defined and used One instruction computes a result into a register (or memory), another instruction needs that result from that same register (or that memory location) Or, one instruction uses a datum; and after its use the same item is then recomputed Dependences require sequential execution, lest the result is unpredictable
Register Dependencies True-Dependence, AKA Data Dependence: <- synonymous! r3 ← r1 op r2 r5 ← r3 op r4 Read after Write, RAW Anti-Dependence, not a true dependence parallelize under right condition r3 ← r1op r2 r1 ← r5 op r4 Write afterread, WAR Output Dependence r3 ← r1 op r2 r5 ← r3 op r4 r3 ← r6 op r7 Write after Write, WAW, use in between
Register Dependencies Control Dependence: if ( condition1 ) { r3 = r1 op r2; }else{ see the jump here? r5 = r3 op r4; } // end if write( r3 );
Register Renaming Only data dependence is a real dependence, hence called true dependence Other dependences are artifacts of insufficient resources, generally insufficient registers This means: if additional registers were available, then replacing some of these conflicting registers with new ones could make the conflict disappear! Anti- and Output-Dependences are indeed such falsedependences
Register Renaming Original Code: L1: r1 ← r2 op r3 L2: r4 ← r1 op r5 L3: r1 ← r3 op r6 L4: r3 ← r1 op r7 Dependences before: Lx Ly which dependence?
Register Renaming Original Code: New Code, after adding regs: L1: r1 ← r2 op r3 r10 ← r2 op r30 –- r30 instead L2: r4 ← r1 op r5 r4 ← r10 op r5 –- r10 instead L3: r1 ← r3 op r6 r1 ← r30 op r6 L4: r3 ← r1 op r7 r3 ← r1 op r7 Dependences before: Dependences after: L1, L2 true-Dep with r1 L1, L2 true-Dep with r10 L1, L3 output-Dep with r1 L3, L4 true-Dep with r1 L1, L4 anti-Dep with r3 L3, L4 true-Dep with r1 L2, L3 anti-Dep with r1 L3, L4 anti-Dep with r3