Pipelining
Modern CPUs implement a technique called pipelining in order to improve performance.
Although you are taught that the CPU cycle consists of four stages (fetch, decode, execute, store) – the reality is more complex.
CPUs are created from billions of logic gates, which are combined to create circuits, each of which has a specific purpose. The reality is that for an instruction to be completely processed by the CPU, it passes through multiple smaller steps. This means that at any one time, a large portion of the CPU is unused.
As an example, a 5 stage pipeline would mean that the execution of an instruction is split into five parts, which occur one after the other. In this case, processing one instruction will only use 20% of the capability of the CPU at any one time: if the CPU waits until the instruction has completed the fifth step before it fetches the next one, we have a lot of wasted capacity.
Instead, the CPU will fetch the next instruction as soon as it can; once the first instruction reaches stage 2 of the pipeline, the next instruction can begin stage 1 of processing: this way, after five instructions, the CPU is being fully utilised. (Assuming the CPU is idle to start with, after one instruction it is 20% utilised, after 2 instructions this rises to 40%, after three it reaches 60% and so on)
Pipeline stages
Taking the entire execution stage of a CPU and splitting it into multiple smaller parts results in smaller, simpler circuits, but more of them. In the past, Intel used this as a strategy for increasing the clock speed of its CPUs (in the Pentium 4 Netburst/Prescott era). The smaller circuits were able to be clocked at a higher speed, allowing them to claim higher clock speeds for their CPUs. (Because each stage of the pipeline has to do less work, it doesn’t take as long, which is why the clock speed can be increased).
However, in order to achieve this, the CPU required a long pipeline: that is, 20 or more stages, meaning parts of twenty instructions could be in operation at any one time on a single core.
This is important.
Look at the following piece of assembly code:
LOAD #100, R1 ADD #101, R1 CMP R1, 200 BGT 100 OTHER INSTRUCTIONS...
This code would load the value from memory location 100 into register R1, and then add the contents of memory location 101 to the value in R1.
It is then compared to the value 200. BGT means ‘branch if greater than’ – so if the combined value of the two memory locations is greater than 200, branch forwards by 100 bytes. If the value is not greater than 200, execution will continue with the creatively named ‘OTHER INSTRUCTIONS…’.
The problem here is that the CPU has no way of knowing in advance what the outcome of the comparison will be. In order to keep the pipeline full, the CPU must keep fetching instructions, decoding them, and executing them. CPUs attempt to do this through the use of branch prediction – that is, they use algorithms to decide the most likely response of the comparison and assume that it will happen. However, in the event that the prediction is wrong, all instructions already being executed in the pipeline are now redundant, and must be flushed (cleared). The CPU will continue execution with an empty pipeline, which must now re-fill before any output can be received.
TLDR; having a longer pipeline allows designers to increase the clock speed, but the longer the pipeline, the more chance of having to predict future instructions. The more frequently future instructions are predicted, the more often the prediction will be wrong. Every time this happens, the pipeline will be emptied, leading to a delay in processing.
There is a tradeoff between lengthening the pipeline to achieve higher clock speeds and reducing efficiency due to failed branch predictions.
Pipelining is a complex and interesting topic; read more about it here.