Real-time Signal Processing on Embedded Systems

Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III

Advances in Microprocessor Technology

Architectural improvementsof microprocessors • Pipelining • Paralle processing exploiting ILP • Superscalar • VLIW • SIMD

Procedure of instruction execution on a processor • Instruction Fetch (IF) • fetches an instruction from main memory. • Instruction Decode (ID) • decodes fetched instruction • Execution (EX) • executes decoded instruction • Memory Access (MA) • accesses to main memory • Write Back (WB) • Write back data to registers

Operation cycles on a processor • Single cycle machine • This kinds of machines execute all procedures from IFto WB in a cycle. • Operation speed is determined by the slowest instruction. (Because all instructions must be executed in a cycle) • Multi-cycle machine • This kinds of machines execute an instruction in several cycles. IF ID EX MA WB

Piepelined operation • can improve throughput of instructions. IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB To realize pipelined operation, several techniques are required. IF IF IF ID ID ID EX EX EX MA MA MA WB WB WB IF ID EX MA WB IF ID EX MA WB

Causes of pipeline hazards • Structural hazard: The hardware cannot cope with the combination of issued instructions. • Data hazard: The latter instruction must wait completion of former instruction because the latter uses the result of the former. • Control hazard: A condition that determines whether an instruction is executed or not depends on the result of the former instruction.

Memory Structural hazard CPU PC Instructionregister Instructiondecoder ALU Registers IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB

Memory Structural hazard CPU PC Instructionregister Instructiondecoder IF ID EX MA WB ALU Registers IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB

Memory MA Structural hazard conflict IF CPU PC Instructionregister Instructiondecoder IF ID EX MA WB ALU Registers IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB

Memory Structural hazard CPU • Resolve 1: to stall the next instruction PC Instructionregister Instructiondecoder IF ID EX MA WB ALU Registers IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB

Memory MA Structural hazard conflict IF CPU • Resolve 2: to add another data bus to access the instruction memory. PC Instructionregister Instructiondecoder ALU Registers IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB

Inst Mem Data Mem Structural hazard CPU • Resolve 2: to add another data bus to access the instruction memory. PC Instructionregister Instructiondecoder ALU Registers IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB Harvard Architecture

Memory Data hazard CPU PC Instructionregister Instructiondecoder add $s0,$t0,$t1 ($s0=$t0+$t1) IF ID EX MA WB ALU Registers sub $t2,$s0,$t3 ($t2=$s0-$t3) IF ID EX MA WB Registers 5 4 3 2 1 t0 t1 t2 t3 t4 0 0 0 0 0 s0 s1 s2 s3 s4

Memory Data hazard CPU PC Instructionregister $s0=$t0+$t1 Instructiondecoder add $s0,$t0,$t1 ($s0=$t0+$t1) IF ID EX MA WB ALU Registers sub $t2,$s0,$t3 ($t2=$s0-$t3) IF ID EX MA WB Registers 5 4 3 2 1 t0 t1 t2 t3 t4 0 0 0 0 0 s0 s1 s2 s3 s4

Memory Data hazard CPU PC Instructionregister $s0=$t0+$t1 Instructiondecoder add $s0,$t0,$t1 ($s0=$t0+$t1) IF ID EX MA WB ALU Registers sub $t2,$s0,$t3 ($t2=$s0-$t3) IF ID EX MA WB $t2=$s0-$t3 Registers 5 4 3 2 1 t0 t1 t2 t3 t4 0 0 0 0 0 s0 s1 s2 s3 s4

Memory Data hazard CPU PC Instructionregister $s0=$t0+$t1 Instructiondecoder add $s0,$t0,$t1 ($s0=$t0+$t1) IF ID EX MA WB ALU Registers sub $t2,$s0,$t3 ($t2=$s0-$t3) IF ID EX MA WB $t2=$s0-$t3 -2=0-2 Registers 5 4 3 2 1 t0 t1 t2 t3 t4 0 0 0 0 0 s0 s1 s2 s3 s4

Memory Data hazard CPU • Waiting by stalls: consuming 3 cycles PC Instructionregister $s0=$t0+$t1 Instructiondecoder add $s0,$t0,$t1 ($s0=$t0+$t1) IF ID EX MA WB ALU Registers sub $t2,$s0,$t3 ($t2=$s0-$t3) IF ID EX MA WB Registers 5 4 3 2 1 t0 t1 t2 t3 t4 0 0 0 0 0 s0 s1 s2 s3 s4

Memory Data hazard CPU • Resolve: forwarding PC Instructionregister $s0=$t0+$t1 Instructiondecoder add $s0,$t0,$t1 ($s0=$t0+$t1) IF ID EX MA WB ALU Registers sub $t2,$s0,$t3 ($t2=$s0-$t3) IF ID EX MA WB Registers 5 4 3 2 1 t0 t1 t2 t3 t4 0 0 0 0 0 s0 s1 s2 s3 s4

Memory Data hazard CPU • Resolve: forwarding PC Instructionregister $s0=$t0+$t1 Instructiondecoder add $s0,$t0,$t1 ($s0=$t0+$t1) IF ID EX MA WB ALU Registers sub $t2,$s0,$t3 ($t2=$s0-$t3) IF ID EX MA WB The result is forwarded to ALU Registers 5 4 3 2 1 t0 t1 t2 t3 t4 0 0 0 0 0 s0 s1 s2 s3 s4

Memory Data hazard CPU • Resolve:forwarding PC Instructionregister $s0=$t0+$t1 Instructiondecoder add $s0,$t0,$t1 ($s0=$t0+$t1) IF ID EX MA WB ALU Registers sub $t2,$s0,$t3 ($t2=$s0-$t3) IF ID EX MA WB The result is forwarded to ALU $t2=9-$t3 7=9-2 Registers 5 4 3 2 1 t0 t1 t2 t3 t4 0 0 0 0 0 s0 s1 s2 s3 s4

Control hazard An instruction sequence including branch add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) or $s3,$s4,$t2 ($s3=$s4|$t2) IF ID EX MA WB IF ID EX MA WB CPU ※ ※ In this explanation, PC adopts word address for simplification. PC:10 Instructiondecoder Instructionregister ALU Registers IF ID EX MA WB

Control hazard An instruction sequence including branch add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) or $s3,$s4,$t2 ($s3=$s4|$t2) IF ID EX MA WB IF ID EX MA WB CPU PC: Instructiondecoder Instructionregister ALU Registers IF ID EX MA WB

Control hazard An instruction sequence including branch add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) or $s3,$s4,$t2 ($s3=$s4|$t2) IF ID EX MA WB IF ID EX MA WB CPU PC:11 Instructiondecoder Instructionregister ALU Registers IF ID EX MA WB

Control hazard An instruction sequence including branch add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) or $s3,$s4,$t2 ($s3=$s4|$t2) IF ID EX MA WB IF ID EX MA WB PC’s value of next instruction depends on the branch condition Branch is taken:PC=40 Not taken:PC=12 CPU PC:12 Instructiondecoder Instructionregister ALU Registers IF ID EX MA WB

Control hazard • Resolve 1:stall add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) or $s3,$s4,$t2 ($s3=$s4|$t2) IF ID EX MA WB IF ID EX MA WB 2 cycle stall The number of required stall cycle aetermined by architecture. IF ID EX MA WB

Control hazard • Resolve 1:stall add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) or $s3,$s4,$t2 ($s3=$s4|$t2) IF ID EX MA WB IF ID EX MA WB 1 cycle stall If the processor can calculate the branch target address at the ID stage. IF ID EX MA WB

Control hazard • Resolve 2: Branch prediction add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) or $s3,$s4,$t2 ($s3=$s4|$t2) IF ID EX MA WB IF ID EX MA WB CPU PC:10 In this example, the next PC is predicted as if the branch is always untaken. Instructiondecoder Instructionregister ALU Registers IF ID EX MA WB

Control hazard • Resolve 2:branch prediction add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) or $s3,$s4,$t2 ($s3=$s4|$t2) IF ID EX MA WB IF ID EX MA WB CPU PC:11 In this example, the next PC is predicted as if the branch is always untaken. Instructiondecoder Instructionregister ALU Registers IF ID EX MA WB

Control hazard • Resolve 2: branch prediction add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) or $s3,$s4,$t2 ($s3=$s4|$t2) IF ID EX MA WB IF ID EX MA WB CPU PC:12 In this example, the next PC is predicted as if the branch is always untaken. Instructiondecoder Instructionregister ALU Registers IF ID EX MA WB

Control hazard • Resolve 2: branch prediction add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) or $s3,$s4,$t2 ($s3=$s4|$t2) IF ID EX MA WB IF ID EX MA WB stall CPU PC:40 If the prediction is missed, in other words, if branch is taken. Instructiondecoder Instructionregister ALU Registers IF ID EX MA WB

Control hazard • More practical scheme: dynamic branch prediction • n-bit counter-based prediction: Branch History Table Address of a branch instraction Lower i-bit n-bit saturating up/down counter

1-bit counter-based prediction 1 0 Predict branch will be taken Predict branch will be untaken Branch is taken Branch is untaken

2-bit counter-based prediction Branch is taken Branch is untaken Predict branch will be taken Predict branch will be taken 01 10 Predict branch will be taken Predict branch will be taken This scheme is adopted in Intel Pentium, Sun Ultra SPARC, MIPS R10000,etc 00 11

Control hazard • Resolve 3:delayed prediction add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) Inserted instruction or $s3,$s4,$t2 ($s3=$s4|$t2) IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB CPU PC:11 An instruction that has no dependency is inserted. Instructiondecoder Instructionregister IF ID EX MA WB ALU Registers

Control hazard • Resolve 3:delayed prediction add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) Inserted instruction or $s3,$s4,$t2 ($s3=$s4|$t2) IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB CPU PC:12 An instruction that has no dependency is inserted. Instructiondecoder Instructionregister IF ID EX MA WB ALU Registers

Control hazard • Resolve 3:delayed prediction add $s0,$t0,$t1 ($s0=$t0+$t1) beq $s1,$s2, 40 (if($s1==$s2){goto 40}) Inserted instruction or $s3,$s4,$t2 ($s3=$s4|$t2) IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB CPU PC:13or40 An instruction that has no dependency is inserted. Instructiondecoder Instructionregister IF ID EX MA WB ALU Registers An instruction at determined address is executed.

Exploiting ILP (Instruction Level Parallelism) • SuperScalar : issuing multiple instructions per cycle with hardware support. • Advantage: binary compatibility. • VLIW: issuing multiple instructions per cycle with compiler support. • Advantage: simple hardware

Types of data dependence • True data dependence (RAW: Read After Write) • Anti-dependence (WAR: Write After Read) • Output dependence (WAW: Write After Write) difficult to remove i1: r2=r1+r3 i2: r4=r2+1 can be removed by register renaming They are called as artificial dependence i1: r1=r2+r3 i2: r2=r4+1 i3: r1=r4+2 Anti Output

Basic Architecture of Superscaler Processor Instructioncache Frontend Instruction decode Branch prediction Datacache Register renaming dispatch commit ・・・・・ Ex-core Back end ・・・・・ Instruction window Registers issue ・・・・・ Reorder buffer Function unit Function unit ・・・・・・・・・・

Basic function of Frontend • provides enough instructions. • predicts next instruction address if branch instruction appears. • resolves artificial dependences by register renaming. • analyzes true data dependence after register renaming. • transfers instructions after the above operations. • This operation is called “dispatch”.

Basic function of Ex-core • finds independent instructions stored in “instruction window” as many as possible. • In this operation, dynamic scheduling is performed to resolve several restrictions: data dependence, resource, prior defined priority, etc. • executes independent instructions in parallel. • An operation that transfers an instruction to a function unit is called “issue”.

Basic function of Backend • updates processor state. • Results obtained as out-of-order are reordered to in-order. • Update of the processor state is performed precisely. • Update of the processor state based on the execution result is called “commit”. • Disappear of instruction is called “retire”.

Dynamic instruction scheduling • Instruction scheduling means to determine issuing order of instructions and when the instructions are issued. • In superscalar processors, dynamic instruction scheduling is performed using instructions stored in the instruction buffer. In the following slides, dynamic scheduling will be explained using several types of processors:1-way in-order processor, i-way in-order processro, and i-way out-of-order processor.

1 way in-order issue • The number of issued instructionsat a cycle is at most 1. • The size of instruction window is 1 because all subsequent instructions cannot be issued if an instruction cannot be issued. • Only true and output dependences should be checked because anti dependence is always resolved.

Control by R flag • R flag is used to check true and output dependences. Registers op dst src1 src2 R value R value Register number Instruction R value R value R value R value R value R value R==false means the register is reserved but the result has not been stored yet. In this case, the operand is not available. Only when R(dst) == true && R(src1) ==true && R(src2), the instruction is issued. (This condition is called “ready”.)

Update sequence of the R flag • R bit of destination becomes false when an instruction is issued. • R bit of destination becomes true when a result is stored in the destination. by the above update, Practically, resource restrictions must be satisfied to issue instructions in addition to the check of dependency. In this lecture, only restriction about function unit is considered to simplify the discussion. • Instructions using unavailable registers as source registers are not issued; true dependence is resolved. • Instructions using unavailable a register as a destination register are not issued; output dependence is resolved.

Real-time Signal Processing on Embedded Systems