PIPELINING AND I/O
ORGANISATION
PIPELINING
A technique of decomposing a sequential process
into sub operations, with each sub process being
executed in a partial dedicated segment that
operates concurrently with all other segments.
R1  Ai, R2  Bi Load Ai and Bi
R3  R1 * R2, R4  Ci Multiply and load Ci
R5  R3 + R4 Add
Ai * Bi + Ci for i = 1, 2, 3, ... , 7
Ai
R1 R2
Multiplier
R3 R4
Adder
R5
Memory
Pipelining
Bi Ci
Segment 1
Segment 2
Segment 3
ARITHMETIC PIPELINING
OPERATIONS IN EACH PIPELINE STAGE
Clock
Pulse
Number
Segment 1 Segment 2 Segment 3
R1 R2 R3 R4 R5
1 A1 B1
2 A2 B2 A1 * B1 C1
3 A3 B3 A2 * B2 C2 A1 * B1 + C1
4 A4 B4 A3 * B3 C3 A2 * B2 + C2
5 A5 B5 A4 * B4 C4 A3 * B3 + C3
6 A6 B6 A5 * B5 C5 A4 * B4 + C4
7 A7 B7 A6 * B6 C6 A5 * B5 + C5
8 A7 * B7 C7 A6 * B6 + C6
9 A7 * B7 + C7
Pipelining
GENERAL PIPELINE
General Structure of a 4-Segment Pipeline
S R
1 1 S R
2 2 S R
3 3 S R
4 4
Input
Clock
Space-Time Diagram
1 2 3 4 5 6 7 8 9
T1
T1
T1
T1
T2
T2
T2
T2
T3
T3
T3
T3 T4
T4
T4
T4 T5
T5
T5
T5 T6
T6
T6
T6
Clock cycles
Segment 1
2
3
4
Pipelining
Behavior of the pipeline is illustrated with a space time diagram.
Space time diagram:
This shows the segment utilization as a function of time.
Space Time diagram
• The horizontal axis displays the time in clock cycle
and vertical axis gives the segment number
• Diagram shows 6 task (T1 to T6)executed in four
segment
Task
is defined as the total operation performed going
through all the segment in the pipeline
Cont….
Consider
• k: segment pipeline with clock cycle time tp to execute n tasks
• first task T1 requires a time equal tkp to complete its operation
since there are k segments in the pipe .
• Remaining n-1 tasks emerge from the pipe at the rate of one
task per clock cycle and they will complete after a time equal to
(n-1)tp.
• Therefore to complete n task using k-segement pipeline
requires K+(n-1) clock cycle.
• Example 4 segment , 6task time required to complete op.
4+(6-1)=9 clock cycle
Cont….
• For nonpipeline unit that perform the same operation and takes a
time equal to tn to complete each h task.
• The total time required for n tasks =ntn
• Speedup of a pipeline processing over an equivalent nonpipeline
processing is defined by the ratio
• S=ntn / (K+n-1)tp
• As the number of tasks increases , n beomes larger the k-1, and
k+n-1 approaches the value of n under this condition ,the speedup
becomes S=tn /tp
• If we assume that the time it takes to process a task is the same in
the pipeline and nonpipeline circuit, tn=ktp
• Including the assumption speedup reduces to S=Ktp/tp=K
• This shows that the theoretical max. speedup that a pipeline can
provide is k, where k is the no. of segment in the pipeline
Cont…
P1
Ii
P2
Ii+1
P3
Ii+2
P4
Ii+3
Multiple Functional Units
Pipelining
Cont…
ARITHMETIC PIPELINE
Floating-point adder
[1] Compare the exponents
[2] Align the mantissa
[3] Add/sub the mantissa
[4] Normalize the result
X = A x 2a
Y = B x 2b
R
Compare
exponents
by subtraction
a b
R
Choose exponent
Exponents
R
A B
Align mantissa
Mantissas
Difference
R
Add or subtract
mantissas
R
Normalize
result
R
R
Adjust
exponent
R
Segment 1:
Segment 2:
Segment 3:
Segment 4:
ARITHMETIC PIPELINE
Reasons why pipeline cannot operate at its max theoretical rate
 Different segment take different time to complete their sub
operation.
 Clock cycle must be equal to time delay of the segment with
the max. propagation time.
 This cause all other segment to waste time while waiting for
the next clock pulse
 Moreover it is not always correct to assume that a non pipe
circuit has the same delay as that of an equivalent pipeline
circuit.
 Many intermediate register not required in single unit, can be
constructed using combinational circuit
4-STAGE FLOATING POINT ADDER
A = a x 2 p B = b x 2 q
p a q b
Exponent
subtractor
Fraction
selector
Fraction with min(p,q)
Right shifter
Other
fraction
t = |p - q|
r = max(p,q)
Fraction
adder
Leading zero
counter
r c
Left shifter
c
Exponent
adder
r
s d
d
Stages:
S1
S2
S3
S4
C = A + B = c x 2 = d x 2
r s
(r = max (p,q), 0.5  d < 1)
Arithmetic Pipeline
INSTRUCTION CYCLE
Six Phases* in an Instruction Cycle
[1] Fetch an instruction from memory
[2] Decode the instruction
[3] Calculate the effective address of the operand
[4] Fetch the operands from memory
[5] Execute the operation
[6] Store the result in the proper place
* Some instructions skip some phases
* Effective address calculation can be done in
the part of the decoding phase
* Storage of the operation result into a register
is done automatically in the execution phase
==> 4-Stage Pipeline
[1] FI: Fetch an instruction from memory
[2] DA: Decode the instruction and calculate
the effective address of the operand
[3] FO: Fetch the operand
[4] EX: Execute the operation
Instruction Pipeline
INSTRUCTION PIPELINE
Execution of Three Instructions in a 4-Stage Pipeline
FI DA FO EX
FI DA FO EX
FI DA FO EX
i
i+1
i+2
Conventional
Pipelined
FI DA FO EX
FI DA FO EX
FI DA FO EX
i
i+1
i+2
INSTRUCTION EXECUTION IN A 4-STAGE
PIPELINE
1 2 3 4 5 6 7 8 9 10 12 13
11
FI DA FO EX
1
FI DA FO EX
FI DA FO EX
FI DA FO EX
FI DA FO EX
FI DA FO EX
FI DA FO EX
2
3
4
5
6
7
FI
Step:
Instruction
(Branch)
Instruction Pipeline
Fetch instruction
from memory
Decode instruction
and calculate
effective address
Branch?
Fetch operand
from memory
Execute instruction
Interrupt?
Interrupt
handling
Update PC
Empty pipe
no
yes
yes
no
Segment1:
Segment2:
Segment3:
Segment4:
RISC PIPELINE
Instruction Cycles of Three-Stage Instruction Pipeline
RISC
- Machine with a very fast clock cycle that
executes at the rate of one instruction per cycle
<- Simple Instruction Set
Fixed Length Instruction Format
Register-to-Register Operations
Data Manipulation Instructions
I: Instruction Fetch
A: Decode, Read Registers, ALU Operations
E: Write a Register
Load and Store Instructions
I: Instruction Fetch
A: Decode, Evaluate Effective Address
E: Register-to-Memory or Memory-to-Register
Program Control Instructions
I: Instruction Fetch
A: Decode, Evaluate Branch Address
E: Write Register(PC)
DELAYED LOAD
Three-segment pipeline timing
Pipeline timing with data conflict
clock cycle 1 2 3 4 5 6
Load R1 I A E
Load R2 I A E
Add R1+R2 I A E
Store R3 I A E
Pipeline timing with delayed load
clock cycle 1 2 3 4 5 6 7
Load R1 I A E
Load R2 I A E
NOP I A E
Add R1+R2 I A E
Store R3 I A E
LOAD: R1  M[address 1]
LOAD: R2  M[address 2]
ADD: R3  R1 + R2
STORE: M[address 3]  R3
RISC Pipeline
The data dependency is taken
care by the compiler rather
than the hardware
M[address 1] = 2000
M[address 2] =2001
Value at 2000 = 5
Value at 2001 =8
R1=5
R2=8
R3=5+8=13
M[address 3] =2003
2003<-r3
2003=13
DELAYED BRANCH
1
I
3 4 6
5
2
Clock cycles:
1. Load A
2. Increment
4. Subtract
5. Branch to X
7
3. Add
8
6. NOP
E
I A E
I A E
I A E
I A E
I A E
9 10
7. NOP
8. Instr. in X
I A E
I A E
1
I
3 4 6
5
2
Clock cycles:
1. Load A
2. Increment
4. Add
5. Subtract
7
3. Branch to X
8
6. Instr. in X
E
I A E
I A E
I A E
I A E
I A E
Compiler analyzes the instructions before and after
the branch and rearranges the program sequence by
inserting useful instructions in the delay steps
Using no-operation instructions
Rearranging the instructions
RISC Pipeline

Pipeline r014

  • 1.
  • 2.
    PIPELINING A technique ofdecomposing a sequential process into sub operations, with each sub process being executed in a partial dedicated segment that operates concurrently with all other segments.
  • 3.
    R1  Ai,R2  Bi Load Ai and Bi R3  R1 * R2, R4  Ci Multiply and load Ci R5  R3 + R4 Add Ai * Bi + Ci for i = 1, 2, 3, ... , 7 Ai R1 R2 Multiplier R3 R4 Adder R5 Memory Pipelining Bi Ci Segment 1 Segment 2 Segment 3 ARITHMETIC PIPELINING
  • 4.
    OPERATIONS IN EACHPIPELINE STAGE Clock Pulse Number Segment 1 Segment 2 Segment 3 R1 R2 R3 R4 R5 1 A1 B1 2 A2 B2 A1 * B1 C1 3 A3 B3 A2 * B2 C2 A1 * B1 + C1 4 A4 B4 A3 * B3 C3 A2 * B2 + C2 5 A5 B5 A4 * B4 C4 A3 * B3 + C3 6 A6 B6 A5 * B5 C5 A4 * B4 + C4 7 A7 B7 A6 * B6 C6 A5 * B5 + C5 8 A7 * B7 C7 A6 * B6 + C6 9 A7 * B7 + C7 Pipelining
  • 5.
    GENERAL PIPELINE General Structureof a 4-Segment Pipeline S R 1 1 S R 2 2 S R 3 3 S R 4 4 Input Clock Space-Time Diagram 1 2 3 4 5 6 7 8 9 T1 T1 T1 T1 T2 T2 T2 T2 T3 T3 T3 T3 T4 T4 T4 T4 T5 T5 T5 T5 T6 T6 T6 T6 Clock cycles Segment 1 2 3 4 Pipelining Behavior of the pipeline is illustrated with a space time diagram. Space time diagram: This shows the segment utilization as a function of time.
  • 6.
    Space Time diagram •The horizontal axis displays the time in clock cycle and vertical axis gives the segment number • Diagram shows 6 task (T1 to T6)executed in four segment Task is defined as the total operation performed going through all the segment in the pipeline Cont….
  • 7.
    Consider • k: segmentpipeline with clock cycle time tp to execute n tasks • first task T1 requires a time equal tkp to complete its operation since there are k segments in the pipe . • Remaining n-1 tasks emerge from the pipe at the rate of one task per clock cycle and they will complete after a time equal to (n-1)tp. • Therefore to complete n task using k-segement pipeline requires K+(n-1) clock cycle. • Example 4 segment , 6task time required to complete op. 4+(6-1)=9 clock cycle Cont….
  • 8.
    • For nonpipelineunit that perform the same operation and takes a time equal to tn to complete each h task. • The total time required for n tasks =ntn • Speedup of a pipeline processing over an equivalent nonpipeline processing is defined by the ratio • S=ntn / (K+n-1)tp • As the number of tasks increases , n beomes larger the k-1, and k+n-1 approaches the value of n under this condition ,the speedup becomes S=tn /tp • If we assume that the time it takes to process a task is the same in the pipeline and nonpipeline circuit, tn=ktp • Including the assumption speedup reduces to S=Ktp/tp=K • This shows that the theoretical max. speedup that a pipeline can provide is k, where k is the no. of segment in the pipeline Cont…
  • 9.
  • 10.
    ARITHMETIC PIPELINE Floating-point adder [1]Compare the exponents [2] Align the mantissa [3] Add/sub the mantissa [4] Normalize the result X = A x 2a Y = B x 2b R Compare exponents by subtraction a b R Choose exponent Exponents R A B Align mantissa Mantissas Difference R Add or subtract mantissas R Normalize result R R Adjust exponent R Segment 1: Segment 2: Segment 3: Segment 4:
  • 11.
    ARITHMETIC PIPELINE Reasons whypipeline cannot operate at its max theoretical rate  Different segment take different time to complete their sub operation.  Clock cycle must be equal to time delay of the segment with the max. propagation time.  This cause all other segment to waste time while waiting for the next clock pulse  Moreover it is not always correct to assume that a non pipe circuit has the same delay as that of an equivalent pipeline circuit.  Many intermediate register not required in single unit, can be constructed using combinational circuit
  • 12.
    4-STAGE FLOATING POINTADDER A = a x 2 p B = b x 2 q p a q b Exponent subtractor Fraction selector Fraction with min(p,q) Right shifter Other fraction t = |p - q| r = max(p,q) Fraction adder Leading zero counter r c Left shifter c Exponent adder r s d d Stages: S1 S2 S3 S4 C = A + B = c x 2 = d x 2 r s (r = max (p,q), 0.5  d < 1) Arithmetic Pipeline
  • 13.
    INSTRUCTION CYCLE Six Phases*in an Instruction Cycle [1] Fetch an instruction from memory [2] Decode the instruction [3] Calculate the effective address of the operand [4] Fetch the operands from memory [5] Execute the operation [6] Store the result in the proper place * Some instructions skip some phases * Effective address calculation can be done in the part of the decoding phase * Storage of the operation result into a register is done automatically in the execution phase ==> 4-Stage Pipeline [1] FI: Fetch an instruction from memory [2] DA: Decode the instruction and calculate the effective address of the operand [3] FO: Fetch the operand [4] EX: Execute the operation Instruction Pipeline
  • 14.
    INSTRUCTION PIPELINE Execution ofThree Instructions in a 4-Stage Pipeline FI DA FO EX FI DA FO EX FI DA FO EX i i+1 i+2 Conventional Pipelined FI DA FO EX FI DA FO EX FI DA FO EX i i+1 i+2
  • 15.
    INSTRUCTION EXECUTION INA 4-STAGE PIPELINE 1 2 3 4 5 6 7 8 9 10 12 13 11 FI DA FO EX 1 FI DA FO EX FI DA FO EX FI DA FO EX FI DA FO EX FI DA FO EX FI DA FO EX 2 3 4 5 6 7 FI Step: Instruction (Branch) Instruction Pipeline Fetch instruction from memory Decode instruction and calculate effective address Branch? Fetch operand from memory Execute instruction Interrupt? Interrupt handling Update PC Empty pipe no yes yes no Segment1: Segment2: Segment3: Segment4:
  • 16.
    RISC PIPELINE Instruction Cyclesof Three-Stage Instruction Pipeline RISC - Machine with a very fast clock cycle that executes at the rate of one instruction per cycle <- Simple Instruction Set Fixed Length Instruction Format Register-to-Register Operations Data Manipulation Instructions I: Instruction Fetch A: Decode, Read Registers, ALU Operations E: Write a Register Load and Store Instructions I: Instruction Fetch A: Decode, Evaluate Effective Address E: Register-to-Memory or Memory-to-Register Program Control Instructions I: Instruction Fetch A: Decode, Evaluate Branch Address E: Write Register(PC)
  • 17.
    DELAYED LOAD Three-segment pipelinetiming Pipeline timing with data conflict clock cycle 1 2 3 4 5 6 Load R1 I A E Load R2 I A E Add R1+R2 I A E Store R3 I A E Pipeline timing with delayed load clock cycle 1 2 3 4 5 6 7 Load R1 I A E Load R2 I A E NOP I A E Add R1+R2 I A E Store R3 I A E LOAD: R1  M[address 1] LOAD: R2  M[address 2] ADD: R3  R1 + R2 STORE: M[address 3]  R3 RISC Pipeline The data dependency is taken care by the compiler rather than the hardware M[address 1] = 2000 M[address 2] =2001 Value at 2000 = 5 Value at 2001 =8 R1=5 R2=8 R3=5+8=13 M[address 3] =2003 2003<-r3 2003=13
  • 18.
    DELAYED BRANCH 1 I 3 46 5 2 Clock cycles: 1. Load A 2. Increment 4. Subtract 5. Branch to X 7 3. Add 8 6. NOP E I A E I A E I A E I A E I A E 9 10 7. NOP 8. Instr. in X I A E I A E 1 I 3 4 6 5 2 Clock cycles: 1. Load A 2. Increment 4. Add 5. Subtract 7 3. Branch to X 8 6. Instr. in X E I A E I A E I A E I A E I A E Compiler analyzes the instructions before and after the branch and rearranges the program sequence by inserting useful instructions in the delay steps Using no-operation instructions Rearranging the instructions RISC Pipeline