CS5365

CS5365 Pipelining

Pipelining • Divide task into a sequence of subtasks. Each subtask is executed by a stage (segment)of the pipe

Linear Pipeline Structure • All stages execute simultaneously different subtask. • A stage is specialized hardware: combinational circuits, A/L operations, processors, etc.

Pipelining • Ideally, all stages take same time to execute their task. Otherwise, the pipe operates at the speed of the slowest subtask.

Pipelining

Clock Period

Note: • Once the pipeline is full it will yield one result every clock period, • A linear pipeline with k stages can process n tasks in: clock periods where k cycles are used to fill up the pipe and complete the first task, and n-1 additional cycles will be needed to complete the n - 1 remaining tasks.

Speedup • Denote Tseq the time required by a non-pipelined uniprocessor to execute n tasks, then: • assuming each of the k operations needs the same length of time to execute.

Speedup • Otherwise

Speedup • The speedup S(k) obtained by a k-pipelined processor is given as follows:

Ideal Pipeline • Speedup S(k) • Note that when n >> k, then

Ideal Pipeline • However, a maximum (ideal) speedup is not possible because overhead due to: • Data dependencies between tasks. • Interrupts. • Program branches.

Space-time diagram

Efficiency • The efficiency is obtained by dividing the speedup by the number of stages k:

Efficiency

Efficiency • Note also that:

Throughput • It is defined as the number of tasks completed per unit of time:

Throughput • Techniques to increase throughput. • Consider a pipeline with the following configuration where: T1=T3=T and T2=3T Clearly the bottleneck is S2 with a 3T delay.

Throughput • Recall that the throughput ω is inversely proportional to the pipe clock period and: So =3T with a total delay of 3x3T=9T and for large number of tasks (steady state operation).

Throughput • How might the throughput be increased?

Throughput • How might the throughput be increased? • Subdivisions?

Throughput • Subdivisions • What is the total delay now?

Throughput • Subdivisions • What is the total delay now? with a total delay of 5T

Throughput • Subdivisions

Throughput • What are the disadvantages of this solution?

Throughput • What are the disadvantages of this solution? • increased hardware, additional latches.

Throughput • Replication: • Stage 2 is replicated into three stages which are then interleaved

Space-time diagram • replication

Control strategies and configurations • Unifunctional vs. multi-functional pipelines • unifunctional pipelines execute a fixed and dedicated function. • A multi-functional pipeline may perform several functions either at the same time or at different times. Multi-functional functions are possible by interconnecting (reconfiguring) several stages at different times.

Control strategies and configurations • Static vs. Dynamic pipelines • A static pipeline may assume only one functional configuration (unifunctional or multi-functional) at a time. • A dynamic pipeline allows several functional configurations at any time (multi-functional) which require a more complex control mechanisms than those required for static pipelines.

Control strategies and configurations • Scalar vs. vector pipelines • Scalar pipelines processes a sequence of scalar operands under the control of a DO loop. Instructions are prefetched and stored in an instruction buffer. As instructions are executed operands are fetched from a data cache. • vector pipelines (vector processors) handle vector instructions over vector operands under firmware and hardware control.

Levels of processing • Arithmetic Pipelines – ALUs are partitions for pipelined operations. Ex: 4-stage pipes are used in the Star-100, Cray-1 uses 14 pipeline stages, the Cyber 205 uses 26 stages, etc. • Instruction Pipelines – (instruction lookahead) - overlaps the execution of the current instruction with the fetch, decode and operand fetch of subsequent instructions. • Processor Pipeline – it is a cascade of processors. Each executes a different task (a job is divided into different tasks).

Floating-Point Arithmetic Pipeline

Processor pipelining

Instruction Pipeline

Instruction Pipelining

Instruction Pipelining • Consider the execution of a single instruction in an uniprocessor system. • A sequence of steps can be identified and implemented using a pipeline design:

Problems?

Problems • Instruction dependency • Pipeline Stalling • Branching • Conflicts • Interrupts

Instruction dependency. • An instruction I + 1 being fetched may need the results of a previous instruction I currently in the pipe. So I +1 must be delayed until results are known. Stalling • An instruction I +1 must not destroy data that can be needed for a previous instruction I still in the pipe.

Dependency

Stalling

Stalling Memory Access Memory Access Memory Access

Stalling

Stalling Assume data cache access Stall condition? Assume in instruction cache access

Branching • This problem is causes by conditional and unconditional branches

CS5365

CS5365

Presentation Transcript

CS5365