Computer Organization and Architecture ( 3 Credits/ SKS)

Computer Organization and Architecture(3Credits/SKS) Prof. Dr. Bagio Budiardjo Semester Genap 2010/2011

About the Course : Course Objectives: After completing this course the students are expected to understand and to be able to analyze the computer architecture, in particular the instruction-set design (e.g. addressingmodes), and its influence to performance. The students are also expected to understand the meaning of computerorganization, that is, the interconnections of computer sub-systems : CPU, memory, bus and I/O from a computing system. The student is expected to understand the more advanced technique in processor design : pipelining. Key words : architecture, instruction-set design, computer organization, performance, processor design and, pipelining techniques

About the grading scheme : • This part is actually not too rigid but it will appear as the combination of : homework, quiz, exercise, mid-test and final-test; whenever possible. • One scheme possible is : Homework : 15% (4) Mid test : 40 % Final Test : 45 % • Grading the homework : Maximum point , 5 point each. Three levels of grading :Good(5), OK(3), and Bad(2).

The books and supporting materials : • Williams Stalling’s book titled Computer Organization and Architecture, Seventh Edition, Prentice Hall 2006; will be used as the main reference for this lecture. There is a new edition of this book, issued in 2010 but up till now is still unavailable in Jakarta. • The classic book is good (Logic and Computer Design Fundamentals) , by Morris M Manno and CharlesKilme - Pearson Asia – 2004), but too many stresses on digital logics. We use materials from this book to explain the hardware design of computer components, whenever possible • Chapters covered will be : Chapters: 1, 2, 3, 4, 5, 10 and 11 and 13 (Stalling’s). Additional materials about pipelining are taken from another book.

Books and supporting materials - continued • There will be no handouts (unless it is very important). • Lecture notes are given through memory stick/CD, SAP could be downloadedfromSIAK-NG • Students are encouraged to read books/papers in this field of study. Schedule of class : • At scheduled time and place (K-102) for about 120 minutes • Lecture will be given mainly using LCD projector

About the“course direction” Why do we study Computer Architecture ? History : Course under this name has been taught in many universities long before the microprocessors exist. Years ago, people studied mainframe architectures : IBM S/370, CDC Cyber, CRAY, Amdahl, etc. Since the microprocessors emerge, this course is changed slightly to cope with more advanced topics: Computer design and performance issues

About the“course direction” Computer Organization & Architecture OAK Micro & Embedded Processors Architecture & Design Analyzing processor design emphasizing on how to obtain better processing speed (Cost effectiveness) Processors Architecture & Design Analyzing & Implementing Computer Systems to achieve best processing speed – Cost effectiveness Microprocessors Application of µproc Parallel & Distributed Computing Systems Organizing Processors/Computing systems to obtain better speed up with different processing paradigm Embedded Systems embeddingµprocbased intelligence to new system/device

About the“course direction” -continued This course is aimed at : 1.Explaining the phenomena of computer architecture and computer design Knowing the basic instruction cycle and its implication to processing speed 2. Studying the “key” problems : a. CPU memory bottleneck b. CPU I/O devices problems 3. Studying how the “performance” could be improved example : CPU-memory : cache memory 4. How could we improve execution speed with other techniques ? Example : pipelining

Reasons for studying Computer Architecture(Stalling’s arguments) • Able to select “proper” computer systems for a particular environment (cost and effectiveness) • Able to analyzed a processor “embedded” to an environment. Able to analyzed the use of processor in automobile, able to use proper tools to analyzed • Able to choose proper software for a particular computer system

View of a Computer System

CPU : Central Processing Unit –Processor Organization : Another view Control Unit MMU : Mem Mng. Unit IR PC R1 To/from memory Cache memory MAR MBR R2 ALU1 ALU2 R3 ADDER Issues : Clock speed, Gating signal ALU3 FPU : Floating Point Unit BUS

Implementation in CHIP

Frequently Asked Question What is the role of CPU clock ? What is the difference between P IV/2.4 G & P IV/3.0 G ? (CPU - clock speed 2.4 and 3.0 Ghz) Consider an instruction of a CPU : AR R1, R2 (add register, content of R1 and content of register R2, place result in R1)

– Execution steps of AR R1,R2 The “possible” micro-execution steps are : a. ALU1 [R1] {content of R1 is moved to ALU1} b. ALU2  [R2] {content of R2 is moved to ALU2} c. ADD {content of ALU1 + ALU2 = ALU3} d. R1  [ALU3]{Result of addition is moved to R1} If, each micro-step is executed in “one” clock-cycle, then this AR instruction needs 4 clock-cycles. For the time being, we ignore the fetch cycle

Question : How do we fetch the instruction? (from memory) • There is a procedure to bring an instruction from memory to CPU (IR), is called the instruction fetch • PC always hold the address of (next) instruction in memory • PC tranfer the address to MAR, and READ memory • PC ususally is icremented by 1 (point to next instruction) • Instruction is placed by memory in MBR • Content of MBR is transferred to IR (instruction is fetched, ready to be executed)

Question : How do we fetch the instruction? (from memory) - continued • Or with register transfer language, we could express the fetch cycle as 1. MAR ← [PC] 2. READ (memory) and wait for completion 3. IR ← [MBR] In terms of CPU clock, this steps may take up to 50 CPU clocks depending on the memory clock speed.

–Processor Organization – continued.1 Control Unit IR To/from memory PC R1 MAR MBR R2 ALU1 ALU2 R3 ADDER ALU3 ALU1[R1] : jalur/unit tidak aktif BUS

–Processor Organization – continued.2 Control Unit IR To/from memory PC R1 MAR MBR R2 ALU1 ALU2 R3 ADDER ALU2 [R2] ALU3 : jalur/komponen tdk aktif BUS

–Processor Organization – continued.3 Control Unit IR To/from memory PC R1 MAR MBR R2 ALU1 ALU2 R3 ADDER ADD ALU3 : jalur/komponen tdk aktif BUS

–Processor Organization – continued.4 Control Unit IR To/from memory PC R1 MAR MBR R2 ALU1 ALU2 R3 ADDER R1 [ALU3] ALU3 : jalur/komponen tdk aktif BUS

Analysis of Instruction Cycle • With single bus, it is slow, since in each “clock” only one transfer could be executed • Is there any other way to “improve” the speed? • Dual bus processor may be faster • Additional processor cost

Dual processor-bus : A way to improve speed 1. ALU1  [R1] (bus1) ALU2[R2] (bus2) 2. ADD 3. R1 [ALU3] (bus1) 1 2 Other components (Control Unit,IR,PC, MAR,MBR) R1 Only 3 clocks cycles needed, 25% faster R2 ALU1 ALU2 How about this : R3 1. ALU1  [R1] (bus1) ALU2[R2] (bus2) ADD 2. R1 [ALU3] (bus1) ADDER ALU3 Only 2 clocks cycles needed, 50% faster DUAL BUS

Triple processor-bus : Can the processing speed imrpoved? 1 2 3 Other components (Control Unit,IR,PC, MAR,MBR) R1 Please notice the direction of arrows R2 ALU1 ALU2 R3 If all the CPU components (registers, ALUs and adder) could work in a one third (1/3) clock cycle (transfer of bits, adding numbers), how many clock (s) needed to complete an addition operation (ADD R1,R2) ? Write down the “register transfer” (micro instruction steps) language! ADDER ALU3 Triple Bus

Program Execution • A scientific program using assembly language is run on a microprocessor with 1 Ghz clock. To complete the program , it needs to execute : a. 150.000 arithmetic instructions (e.g ADD R1,R2; MUL R1,R3; etc) b. 250.000 register transfer instructions (e.g MOV R1,R2; etc) c. 100.000 memory access instructions (e.g LOAD R1,X; STORE R2,Y; etc). If, average arithmetic instructions need 2 clocks (to complete), average register transfer instructions need 1 clock and average memory access instructions need 10 clocks; calculate the average CPI (clock per instruction) of the above mentioned program. How many times it needs to complete the program (in seconds)?

Can it be “one clock?” – Yes it can !Views of Other Books on “Micro Operations” • The Bus is called “data path” • It is not only consist of bus (a bunch of wires), but other digital devices • Enable signals is forced to fasten execution • Additional (processor) cost

Datapath Example : Taken from Morris Manno’s book Load enable A select B select Write A address B address n D data Load R0 2 2 • Four parallel-loadregisters • Two mux-based register selectors • Register destination decoder • Mux B for external constant input • Buses A and B with externaladdress and data outputs • ALU and Shifter withMux F for output select • Mux D for external data input • Logic for generating status bitsV, C, N, Z n n Load R1 0 n 1 MUX 2 n 3 0 1 MUX Load 2 R2 3 n n Load R3 n n 0 1 2 3 n Register file Decoder A data B data D address n n 2 Constant in Destination select n 1 0 MB select MUX B Address n Bus A Out n Bus B Data A B Out n G select H select B A B 4 2 S S || C 2:0 in I I 0 0 V Shifter Arithmetic/logic R L unit (ALU) C H G N n n Zero Detect Z 0 1 MF select Function unit MUX F F Data In n n 0 1 MD select MUX D Bus D n

Load enable A select B select Write A address B address n D data • Apply 01 to A select to place contents of R1 onto Bus A Load R0 2 2 n n • Apply 10 to B select to place contents of R2 onto B data and apply 0 to MB select to place B data on Bus B Load R1 0 n 1 MUX 2 n 3 0 1 MUX Load 2 R2 • Apply 0010 to G select to perform addition G = Bus A + Bus B 3 n n Load R3 • Apply 0 to MF select and 0 to MDselect to place the value of G onto BUS D n n 0 1 2 3 n Register file Decoder A data B data D address n n 2 Constant in Destination select n 1 0 MB select • Apply 00 to Destination select to enable the Load input to R0 MUX B Address n Bus A Out n Bus B Data A B Out n G select H select • Apply 1 to Load Enable to force the Load input to R0 to 1 so that R0 is loaded on the clock pulse (not shown) • The overall microoperation requires1 clock cycle (!) B A B 4 2 S S || C 2:0 in I I 0 0 V Shifter Arithmetic/logic R L unit (ALU) C H G N n n Zero Detect Z 0 1 MF select Function unit MUX F F Data In n n 0 1 MD select MUX D Bus D n Datapath Example: Performing a Microoperation Microoperation: R0 ← R1 + R2

Lesson Learned • We could improve the instruction execution speed by increasing processor clock speed (can we?) • We could improve the instruction execution speed by implementing dual bus (can we?) • We can overcome (partly) the CPU-Memory bottleneck by inserting cache memory between CPU and Main Memory (can we?) • Is there any other way to improve instruction execution speed (increasing performance)? - pipelining • Are these improvements need extra cost? (cost vs performance issue)

What do we get after studying Computer Architecture ? • It is always a complicated problem to answer. • Basically we learn about the processor design issues, namely hardware of a computer but it was taught through “software” logics. • At least we know about basic building blocks of a computer • We know the design development trends

Application Program Compiler OS ISA CPU Design Circuit Design Chip Layout What is our topic ? Intruction Set Architecture(ISA)

Chapter 1 : Introduction

1. 1. Introduction : Organization & Architecture • Organization and Architecture : two jargons that are often confusing • Computer organization refers to the operational units and their interconnections that realize the architectural specifications (!) • Computer Architecture refers to those attributes of a system visible to a programmer, or put another way, those attributes that have a direct impact on the logical execution of a program (!) • The later definition (architecture) concerns more about the performance, compared to the first one (organization)

1. 1. Introduction - continued • Architecture concerns more about the basic instructiondesign, that may lead to better performance of the system • Organization, is the implementation of computer system, in terms of its interconnection of functional units : CPU, memory, bus and I/O devices. • Example : IBM/S-370 family architecture. There are plenty of IBM products having the same architecture (S-370) but different organization, depending on its price/performance measures. Cost and performance differs the organizations • So, organization of a computer is the implementation ofits architecture, but tailored to fit the intended price and performance measures.

Chapter 2 : Computer Evolution and Performance

ENIAC - background • Electronic Numerical Integrator And Computer • Eckert and Mauchly • University of Pennsylvania • Trajectory tables for weapons • Started 1943 • Finished 1946 • Too late for war effort • Used until 1955

ENIAC - details • Decimal (not binary) • 20 accumulators of 10 digits • Programmed manually by switches • 18,000 vacuum tubes • 30 tons • 15,000 square feet • 140 kW power consumption • 5,000 additions per second

ENIAC

Another View of ENIAC

Structure of von Neumann machine

IAS - details • 1000 x 40 bit words • Binary number • 2 x 20 bit instructions • Set of registers (storage in CPU) • Memory Buffer Register • Memory Address Register • Instruction Register • Instruction Buffer Register • Program Counter • Accumulator • Multiplier Quotient

2. 1.Evolution and Performance - history • 1946 Von Neuman and his gang proposed IAS (Institute for Advanced Studies) • The design included : • main memory • ALU • Control Unit • I/O • First Stored Program, able to perform : +, -, x, : • The “father” of all modern computer/processor

Structure of IAS

IAS

2. 1. Evolution and Performance -history IAS components are : • MBR (memory buffer register), MAR (memory address register), IR (instruction register), IBR (instruction buffer register), PC (program counter), AC (accumulator and MQ (multiplier quotient), memory (1000 locations) • 20 bit instruction : 8 bit opcode, 12 bit address (addressing one of 1000 memory locations - 0 to 999) • 39 bit data (with sign bit - 1 bit) • Operations : data transfer between registers and ALU, unconditional branch, conditional branch, arithmetic, address modify

2.1. Evolution - History of Commercial computers • First Generation : 1950 Mauchly & Eckert developed UNIVAC I, used by Census Beureau • Then appeared UNIVAC II, and later grew to UNIVAC 1100 series (1103, 1104,1105,1106,1108) - vacuum tubes and later transistor • Second Generation : Transistors, IBM 7094 (although there are NCR, RCA and others tried to develop their versions - commercially not successful) • Third Generation : Integrated Circuit (IC) - SSI. IBM S/360 was the successful example • Later generations (possibly fourth and fifth) : LSI and VLSI technology

2.1. Evolution - history of commercial computers Table 2.1 Approx Speed Generation Time Technology (opr/sec) -------------------------------------------------------------------------- 1. 1946-57 Vacuum tube 40,000 2. 1958-64 Transistor 200,000 3. 1965-71 SSI & MSI 1,000,000 4. 1972-77 LSI 10,000,000 5. 1978- VLSI 100,000,000 --------------------------------------------------------------------------

Vaccum Tubes

Transistor

Computer Organization and Architecture ( 3 Credits/ SKS)