The Expanded Infodump About the CPU

Instruction Sets

Early computers had nothing resembling instruction sets. To program a computer, one had to painstakingly wire the inputs to the proper execution units (adders, logic gates, etc.). Of course this grew into a test of sanity once job requests were becoming more frequent and the calculations went beyond making artillery tables. Once some form of memory became available, it was possible to have the computer "rewire" itself through the use of instruction sets. In the beginning this was something like instruction 0000 was moving some data, 0001 was adding data, etc. The "rewiring" part came from an instruction translator of sorts, which could be thought as a bunch of logic gates to determine where the inputs and outputs should go.

There was just one problem. While computation speeds were increasing quite nicely, memory speeds and capacity were not. As higher level programming languages were developing and taking off, newer processors were built to support these directly to make the execution code as compact as possible. This happened until about the mid 1980s when computer scientists began to figure out that performing simpler tasks in sequence could be done much quicker, often one instruction per clock cycle. Processors were built upon this new design paradigm and given the name Reduced Instruction Set Computing (RISC). Reduced in this case means the time it takes to complete an instruction is reduced. Retroactively, processors made before then were given the name Complex Instruction Set Computing. RISC proved to be such an efficient method of execution that many manufacturers by the late 1990s had or were making RISC processors. x86 is the only, if not one of the only, CISC ISAs still in widespread use. However, modern x86 processors are designed such that the instructions are decoded into micro-ops and performed in a RISC-like manner.

Later improvements were mostly aimed at trying to improve instruction level parallelism, or trying to do more instructions at once. The major stumbling block with this is that the program that's being run has to be highly predictable.

Note that instruction sets are primarily the interface between software and hardware. It is the "language" that the two use to speak to each other. How well the CPU actually performs depends on the implementation, not the instruction set itself. For example, while ARM is RISC based, for the longest time none of its implementations could beat an x86 one, despite x86 being the "slower" CISC instruction set.

Instruction Set Types

Complex Instruction Set Computer (CISC)

Examples: Intel x86, AMD x86-64

In the early days of high level programming, memory was expensive, small in capacity, and the speed gap between it and the CPU was growing. The idea of CISC was to cram several operations into a single instruction (such as "Compare and Exchange 8 bytes") and to allow multiple ways to access the data (encoded as part of the instruction, in a register, or in memory). As the name implies though, there are complexities implementing CISC ISAs because operations can have variable length instructions and their behavior depends on the parameters passed into it.

Reduced Instruction Set Computer (RISC)

Examples: ARM, IBM PowerPC, AMD GCN

Developed in the 70s, the principle of RISC is to reduce the time it takes to complete each instruction. It was found even on a CISC machine, that doing a string of simpler instructions was faster than doing one complex instruction that handles everything those simpler instructions were doing. For example, a CISC CPU can have an operation like "add the numbers in memory locations A and B" using the same "ADD" mnemonic. But "ADD" can also mean "add the numbers in register A and memory location B" or "add the value A to memory location B". The CISC CPU has to spend extra cycles decoding the actual intent of the instruction.

Soon after processors were designed with the aim of reducing the amount of time it takes to complete an instruction, with goal being one instruction per clock. To do this, RISC architectures simplify the instruction execution process by:

Making uniform instruction sizes. While using fixed instruction sizes for the entire ISA is often the goal, some ISAs may allow for variable length instructions, but there are mechanisms in place to limit the variability. For instance, ARM has 16, 32, and 64-bit instructions, but it can only execute them in a specific CPU mode. Another example is MIPS introduced 16, 32, and 48-bit instruction sizes, but different opcodes represent the same operation of different instruction sizes (e.g., a 16-bit ADD instruction or a 48-bit ADD instruction). This contrasts with say x86 where the same operation maps to the same opcode, but it may have multiple versions of it of different sizes, with some instructions being up to 15-bytes in length.
Reducing memory access types that an instruction can do. Operations that modify data can only do so on registers or with an immediate value (a value that's part of the instruction). Contrast to x86 which can do a combination of registers, immediate values, or memory locations.
Supporting only integers and floating point data types and not special types like Binary Coded Decimal or Strings
Having a lot of general-purpose registers rather than a smaller set of register with more special-purpose registers. Originally x86 had 4 out of 14 registers meant for general purpose operations, compared to ARM which has 8-12 registers for general purpose operations. Though most special-purpose registers can be read from/written to by software without restriction, so they gradually became special-purpose in name only (though it helps compilers out when it comes to knowing which registers to use for a certain purpose)

By the 2000s, RISC had largely taken over. While x86 remains in widespread use, most implementations decode and translate it into something more RISC-like, with the execution core acting more like a RISC processor.

Very Long Instruction Word (VLIW)

Examples: Intel IA-64, Transmeta Crusoe, AMD Terascale, MCST Elbrus

Conceptualized in the late 1980s, VLIW was designed to take advantage of multiple/parallel computation units. The instruction word itself contains a number of instructions, similar to packing several RISC instructions together. This may sound similar to CISC, but each sub-instruction is still a simple independent operation while CISC tries to do many things at once as an independent instruction. A major issue is that the programs have to be highly predictable and often times the programs are recompiled and sorted so they can run in a parallel fashion ahead of time (something called static scheduling). While it was tried to varying degrees of success around 2000 to the early 2010s, it's rare to encounter a native VLIW processor.

No Instruction Set Computing (NISC)

Examples: NVIDIA's GPUs since the GeForce 600

An evolution of VLIW, the idea of NISC is to instead of compiling software into instructions and let the processor figure out how issue those instructions on its resources, the compiler can figure out where the data being fed into the processor goes and the instructions are only telling the processor how this data flow should work. That is, if you think of a processor's execution units (add, shifting, etc) as modules, you only need to say "the input of this module reads from here and the output of this module goes to there." While technically needing instructions, these only direct the data flow, whereas traditional CPU instructions direct control flow.

In a real life analogy, if you're in a cafeteria with various stations and someone staffing them to hand out food. Instead of saying "get some turkey, get some green beans, and get some mashed potatoes", which requires a person to recognize what and where they are, you could say "go to stations 1, 3, and 6." You get the same result, but the latter is simpler to resolve. This has the obvious downside that NISC-based software only works for an exact implementation of hardware.

While it's experimental in CPUs, it's seen adoption in GPUs in NVIDIA's GPUs since the GeForce 600 series. Like VLIW, this kind of instruction set requires a highly predictable program to run on it for maximum effectiveness. Graphics processing is one of those highly-predictable programs.

There's also a similarly sounding "Zero instruction set computer", but this describes a computer that runs entirely off of a neural network.

Multitasking

Early computers could only do one thing at a time and often it would do that one thing until it was done. This could be something like loading instructions and data, executing them, and saving the data. Loading and storing took a lot of time and the computer wasn't doing anything really useful either. IBM had the clever idea of separating those three tasks. The loading and storing steps were done by two, cheaper machines whose only job was to do those things. Then the actual processor could work on the job at hand. Even better, with multiple load/store machines, one could queue up a string of jobs to feed the main computer.

Once computers got powerful enough and memory plentiful enough, the idea of context switching came about. Context switching is when the processor save its state (usually the important registers) into main memory and switches to another task by loading the state for that task into itself. Computers around this time were still big honking mainframes with dumb terminals connected to it. With context switching, if one terminal had a job for the computer, it would do it until some block came up (usually waiting for data or peripheral to be free). When that block came up, the mainframe switches to run another user's task that was ready to be executed, repeating if another block came up. These early implementations exposed a number of problems quickly:

Someone could hog the mainframe with an extremely long task or take it down with a poorly written program that has say an infinite loop.
Someone could be using a resource, like a printer or storage unit, for extended periods of time, blocking others who are waiting on it
Two users want a resource the other user has, but they won't give it up until the other user gives up theirs first.

To combat this, several scheduling methods were created:

First in, first out (FIFO): The earliest implementation. Basically jobs are done in the order received.
Shortest Job First: All tasks are sorted by completion time. The job with the shortest time to complete is scheduled.
Round Robin: The processor gives a slice of time to a task, once the time is up, it moves to another task. When time slices are short enough, it appears the computer is doing many things at once to the user.
Priority scheduling: All tasks are given priority. When a task is done, the schedule does the next highest priority task. A lower priority task can be pre-empted by a higher priority task between jobs.

Which led to two primary types of multitasking:

Cooperative: While the computer can have multiple programs ready for execution in memory, the program currently running has complete control of the processor. The program has to give up the processor for another one. This means that if the program crashes, it will bring down the rest of the system.
Preemptive: Typically combines round robin and priority scheduling. This allows multiple programs to run seamlessly together, giving the impression they're running at once.

Other Improvements

Most of the improvements listed here are designed to prevent one thing: execution bubbles. Or simply put, execution stalls because something happened, be it waiting for data from main memory or the result of an earlier operation to be completed.

Pipelining

Pipelining in a processor is a way of emulating an assembly line in order to increase throughput, often with the goal being one instruction per clock cycle. Every CPU has pipelining, but what differs is how many stages the instruction goes through before being completed. Some processor families have a two-stage pipeline (instruction fetch -> execute). The classic RISC pipeline uses five stages. In practice, modern processors tend to have at least 8-10 stages.

To copy from The Other Wiki on how this works:

Consider the assembly of a car: assume that certain steps in the assembly line are to install the engine, install the hood, and install the wheels (in that order, with arbitrary interstitial steps). A car on the assembly line can have only one of the three steps done at once. After the car has its engine installed, it moves on to having its hood installed, leaving the engine installation facilities available for the next car. The first car then moves on to wheel installation, the second car to hood installation, and a third car begins to have its engine installed. If engine installation takes 20 minutes, hood installation takes 5 minutes, and wheel installation takes 10 minutes, then finishing all three cars when only one car can be assembled at once would take 105 minutes. On the other hand, using the assembly line, the total time to complete all three is 75 minutes. At this point, additional cars will come off the assembly line at 20 minute increments.

The main downside to pipelining, especially when a CPU has lengthy number of stages (known as a deep pipeline), is if there's any branching that needs to be done, such as an if-statement, then the pipeline stalls until the branch is resolved. While branch prediction (discussed later) can help avoid stalls, if it predicts wrong, then there'll still be a penalty. Taking the car factory example, if the factory can only work on one type of product at a time and it wants to make another, it has to wait for all the previous stages of the assembly line to complete before it can rework itself to produce the new product.

The number of pipeline stages also determines, of all things, how fast the CPU should be clocked. Aside from the physical limitations on why CPUs can't be clocked really fast, there's a theoretical limit based on how long each stage of the pipeline takes to complete. Ideally every stage should take the same amount of time. If for example, if we have a 3-stage pipeline where two stages take one clock cycle while the other takes two, that slower pipeline stage starts creating a gap that grows up the two cycles it takes to complete it.

Cache

Caching is a response to RAM progressing in performance slower than the CPU. Caching stores recently used instructions and data so that the CPU can refer to that instead of RAM. Cache is often very fast as well. However, cache memory is expensive and adding more doesn't always improve performance. To given an example of how cache memory works:

Consider a library. Popular books are constantly being checked out. To save time so that the librarian doesn't have to process the book, store it back, and have the next person who wants to borrow it find the shelf it's in, recently returned books are put in a cart next to the checkout area. A person who wants to borrow the book can then quickly find it there and get out sooner. If the book is no longer requested as much, it eventually gets put back. If the recently returned cart is too large, then the person or librarian may spend more time looking through it than it would take for the person to go find the book in its normal location.

A semi-related note, if one looks into the specs of the CPU, they'll likely find that the CPU has two types of L1 cache: one for instructions and one for data. This might seem odd since everything above this in the Memory Hierarchy allows for both instructions and data to exist in it. This separation mostly has to do with two schools of thought with regards to how CPUs access memory. The first is the von Neumann architecture, which is where data and instructions can live in the same memory pool. The second is the Harvard architecture, which separates data and instructions in their own memory pools. von Neumann architecture's main benefit is cost (only need one pool of memory and one bus to access it), while Harvard architecture's main benefit is performance (CPU can fetch both instructions and data independently and doesn't have to figure out which one is which). Modern CPUs use the so-called modified Harvard architecture, where the execution core itself is Harvard, but everything else is von Neumann. In addition, modern memory controllers can also emulate the Harvard architecture with von Neumann implementations by marking sections of memory space as "no execute," making it a data-only space.

Branch Prediction and Speculative Execution

In order to help keep the pipelining flowing, logic was added in the instruction decoding and scheduling parts of the CPU to execute branches ahead of time. Branch prediction aims to guess which branch will be taken and queues up those instructions. The guessing is based on several methods, such as simply always taking one branch or another or having a counter that keeps track of how many times a branch took one path or the other. With branch prediction came speculative execution, which goes ahead with executing instructions ahead of time, even if the CPU doesn't actually know if those instructions are needed.

Issues with speculative execution came about in 2017, which is discussed in the section below.

Out of Order execution (OOE)

A type of instruction reordering where some instructions queued for execution in the future can cut in line if they're not dependent on results from an earlier instruction. This is to prevent cases where an earlier instruction can stall the processor if it's waiting for something but later instructions can run right away. OOE's main issue is it requires complex hardware to ensure that the output retains the logical ordering, often eating into die space and power consumption. Historically it was kept out of processors meant for small electronics, but has crept its way back in because performance boost it offers now outweighs its drawbacks.

Note this is different from speculative execution. Out of order execution re-orders the instruction queue rather than queue up instructions that CPU is not sure if it'll run.

Superscalar execution and simultaneous multithreading (SMT)

Superscalar execution is when the execution units of a processor are duplicated. This allows it to run more instructions at once for the same program. For instance, if there are a number of operations that don't depend on the results of each other, then they can all execute at once. Strictly speaking though, this isn't an implementation of a Multi-Core Processor though, as the instructions must come from the same unit of execution of a program (typically a thread).

However, some processor manufacturers found a way to simulate a multicore processor via a technique called simultaneous multithreading (SMT). By allowing multiple units of execution to live in a CPU core, after scheduling one unit of execution to run, if there are any execution resources left, another unit of execution can run on those. You can think of this like two children sharing a box of crayons or technicians sharing a toolbox rather than everyone having their own. One famous implementation of this is Intel's HyperThreading as they were the first company to implement this feature for consumer computers, though most others simply call it SMT.

Microcode, Micro-instructions, and Micro-operations

These three terms are related, but the gist of is that this solves a problem when the CPU doesn't behave as intended, either due to a poor design or a complex implementation gone wrong.

In simpler, traditional CPUs, the control unit, which handles how instructions are executed and how data is directed, used to act directly on the instructions from the ISA. However, as ISAs get more complicated, along with innovative ways to execute instructions get created, having the control unit be directly controlled by the ISA was starting to prove to be a limiting factor. In addition, since the instruction decoder is often hard-wired, if there is any change in how the ISA works or a bug in how the instructions are decoded, it'll require an expensive redesign of that part. This is especially true with bugs, as this means older processors will still have a problem.

This is where the micro-code/instruction/operation comes in. Rather than handle the ISA directly, the control unit and execution unit work on its own unique sort of ISA typically called micro-instructions. These map to micro-operations that the execution unit handles. The ISA is translated via microcode, which in its simplest implementation, is a ROM containing a mapping of which ISA instructions map to which micro-instructions. The main benefit of this is that if there's an issue with how the ISA is translated via microcode, it can be patched using an update. Though in most modern CPUs that employ this, it's stored in SRAM inside the CPU, so any microcode updates will have to be applied again on boot, mostly as a means to make sure the CPU isn't made worse by a bad microcode update or if the user wants to decide whether or not the benefit is worth it (as was the case with Spectre/Meltdown updates).

Heterogenous design

An approach that started with mobile devices, heterogenous designs (also known as a hybrid CPU architecture) pair fewer, larger high performance CPU cores with more, smaller high efficiency cores. The idea is that the high performance cores (typically called P-cores) can handle tasks that where faster performance is preferable or where low response times are needed. The high efficiency cores (typically called E-cores) can handle background or time-insensitive tasks or help boost the performance of the P-cores. Some hybrid designs may include 3 or more performance layers, but usually only 2 are used.

Companies started considering this as they realized for certain applications, only a few tasks need high-performance or low-latency. Especially in mobile applications where the only time where the device needs high performance is to service requests with a hard time limit (such as talking to a cell phone tower) or with games (at least those with more flair than Candy Crush Saga). Since processor design has to slide between performance or efficiency and with higher performing cores gulping up more power, it didn't make sense to service background tasks on these cores. So the idea of throwing tasks on power efficient cores, where people wouldn't really notice a loss in performance but, in theory, improve battery life. It has since made its way to desktop and laptop computers, such as Intel's Alder Lake.

The trick with this design is to make sure the E-cores aren't so performance deficient that they spend more energy overall doing the task. That is, if an E-core is only 50% as performant as a P-core, it needs to consume well below 50% of the power, otherwise it's no different than running it on a P-core, or worse. In addition, you need to be able to stuff more E-cores in the same die space as a P-core to make the value proposition better. In Intel's case, they were able to fit 4 E-cores in the same space as a P-core, with the E-core performing about 60% as good as a P-core while consuming a quarter of the power (so 4 E-cores = 1 P-core in power consumption). In theory this means that a 4 E-core cluster has better efficiency in performing a task than 1 P-core.

Another major issue is the software has to be aware of this, as software generally thinks there's only one core type. You don't want a high performance task running on an E-core after all. The OS usually handles scheduling software though, so most developers won't have to think about this.

Branching and data loading

Branching is when a computer decides between different code paths to take. Data loading is loading data from RAM. When performance is a concern, both should be avoided as much as possible for the following reasons

If the code takes a different different path, then any pre-processing done due to instruction pipelining has to be thrown away.
Certain types of processors, such as the Graphics Processing Unit, are designed as single-instruction, multiple data (SIMD) processors, meaning they run the same instruction for a large data set. Branching, while possible, means that only some of the data is processed which reduces the efficiency of the processor.
Loading data on CPUs is often slower than just calculating what's needed. For example, when placing a GUI element on screen, it's cheaper set an anchor point and calculate an offset using constant values encoded in the instructions than it is to store where the object should be and load those coordinates from memory. Another example is with physics, such as launching an object. Rather than store its current conditions such as coordinates, momentum, etc, in RAM and retrieve it every time its position is updated, use only starting parameters and calculate based on age. Since the starting parameters don't change, they could either live in cache or be encoded as part of the instruction and since most physics calculations are time based, the CPU can easily calculate a new position faster than it would take to grab and store a bunch of data points.

There have been partial programming solutions to this problem; While branching code is expensive, branching results is not; for example most modern compilers and assembly programmers produce code that calculates everything in both branches, and multiplies the result by 0 if not valid. So, for example:

if (number < 0)
 number = number - 1
else
 number = number + 1

Can be written as

a = number < 0 # Returns 1 if number < 0, 0 otherwise
b = number + 1
c = a ^ 1 #XOR operation, c becomes 1 if a is 0 and vice versa
d = number - 1
number = (a * b) + (c * d)

(Please note, don't actually do this in practice. It's easier to read the first part and most compilers will optimize it to generate the second part anyway)

Not only is this done for performance reasons, but in light of issues with speculative execution, these become even more important for security reasons. While hardware can make sure the proper security checks are done, it's just easier and faster to make sure the work being done is issue free from the start.

Another way of looking at it is branching, especially if-statements, are an eyesore and headache for humans to read. For every possible branch, the program's complexity increases by at least two. If you use nested if-statements, it becomes harder to track what was being done. If it's easier for you to understand the program, it's stupid easy for the computer to execute it.

An alternative to branching is something called predication. That is instead of evaluating a value and jumping to certain parts of code depending on the value, save the result in a register, then tag the affected instructions with that register to determine whether it will be executed or not. Here's an example of what this might look like:

if (number < 0)
 number = number - 1
else
 number = number + 1
# Branching code
 cmp number, 0 # Compare number to 0
 bz Path_1 # If number = 0, jump to Path_1
 jmp Path_2 # Jump to Path_2 otherwise
Path_1:
 sub 1, number # Subtract 1 from number
 jmp End
Path_2:
 add 1, number # Add 1 to number
End:
 ret # Return

# Predicated code
 cmp number, 0, p1, p2 # Compare number to 0, store result in p1, the opposite result in p2
 (p1) sub 1, number # If p1 is true, subtract 1 from number
 (p2) add 1, number # If p2 is true, add 1 to number
 ret

Some CPUs over time have implemented forms of predication, but a major disadvantage of it is you're still executing the same number of instructions regardless of which path you take. This penalizes taking shorter paths because if the longer path is long enough, predication will perform worse than simply branching.

Hardware bugs and security issues

Many parts of the processor are essentially hardware "programs." Because of this, bugs in hardware have showed up that prevented processors from running correctly. These bugs either result from invalid instructions (an instruction value that maps to nothing, resulting in undefined behavior) or running a series of instructions that result in painting the CPU in a corner, like disabling interrupts then do a 'sleep until interrupted' instruction. Infamously these are called "Halt and catch fire"

routines. Urban legend has it that if a processor ran into this condition, it would execute a single instruction over and over to the point where the electricity going through the processor is concentrated on the part servicing the instruction, causing it to heat up to the point of catching fire.

As processors got more complex, these hardware bugs become more costly to fix because it requires another revision of the hardware. Also by the time the bug is discovered and characterized, it's likely thousands and thousands of units were sold and so you're left with people who have the issue. Often the company must have a free-replacement program in order to not be sued. Software can mitigate the problem by not running the offending instructions or working around them, but as long as the hardware bug exists, there's always the potential of the computer crashing or worse. This is where microcode designs come in, but since this only affects how instructions are handled before being sent off to the execution unit, any bugs past decoding can still present problems.

This all came to a head in 2017, when it was discovered that a lot of processor families that pipelining, speculative execution, and out-of-order execution implemented the features without security considerations in mind. That is, certain instructions can be executed before security checks are done, allowing a malicious program to snoop in areas that it's not supposed to be allowed to see. This was the basis for the Meltdown and Spectre security vulnerabilities.

A note about "Process Node"

The term "Process Node" generally refers to how small the smallest feature of an integrated circuit (IC) is, and everything else is built around that. So for instance, a "10nm process" is meant to depict the idea that the smallest feature in a 10nm built circuit is 10nm. However, this hasn't held true since 1994 and the measurements are a marketing term to avoid having too-technical of a way to describe the way the IC was built. For instance, would you rather have "45nm process" or "High-k dielectric process"?

In addition, not every manufacturing firm produces the same result and their techniques cause various differences despite being marketed with the same process node. For instance, AMD's Zen+ based processors are built from a 12nm process, but the chips themselves are the same size as the previous generation Zen processors which were built from a 14nm process. The key difference was AMD wanted more of a buffer between active components to help with heat transfer. This also led to confusion with where Intel was versus their immediate competitors of TSMC and Samsung. For instance, the Intel 10 process suggests its worse than TSMC or Samsung's 7nm process, but Intel's been able to achieve a transistor density similar to their competition's 7nm process.

Simply put, it's best to interpret the number as nothing more than a generational number, rather than anything of technical value.

Analysis / Central Processing Unit

The Expanded Infodump About the CPU

Instruction Sets

Instruction Set Types

Multitasking

Other Improvements

Pipelining

Cache

Branch Prediction and Speculative Execution

Out of Order execution (OOE)

Superscalar execution and simultaneous multithreading (SMT)

Microcode, Micro-instructions, and Micro-operations

Heterogenous design

Branching and data loading

Hardware bugs and security issues

A note about "Process Node"

Previous

Index

Next

Analysis / Central Processing Unit

Edit Locked

The Expanded Infodump About the CPU

Instruction Sets

Instruction Set Types

Multitasking

Other Improvements

Pipelining

Cache

Branch Prediction and Speculative Execution

Out of Order execution (OOE)

Superscalar execution and simultaneous multithreading (SMT)

Microcode, Micro-instructions, and Micro-operations

Heterogenous design

Branching and data loading

Hardware bugs and security issues

A note about "Process Node"

Previous

Index

Next