History for Analysis/CentralProcessingUnit

rmctagg09

Changed: 27

May 2nd 2024 at 12:30:21 AM

Is there an issue? Send a MessageReason:

None

Changed line(s) 1,2 (click to see context) from:

Back to UsefulNotes/CentralProcessingUnit

to:

Back to ~~UsefulNotes/CentralProcessingUnit~~
MediaNotes/CentralProcessingUnit

Changed line(s) 103,104 (click to see context) from:

Superscalar execution is when the execution units of a processor are duplicated. This allows it to run more instructions at once for the same program. For instance, if there are a number of operations that don't depend on the results of each other, then they can all execute at once. Strictly speaking though, this isn't an implementation of a UsefulNotes/MultiCoreProcessor though, as the instructions must come from the same unit of execution of a program (typically a thread).

to:

Superscalar execution is when the execution units of a processor are duplicated. This allows it to run more instructions at once for the same program. For instance, if there are a number of operations that don't depend on the results of each other, then they can all execute at once. Strictly speaking though, this isn't an implementation of a ~~UsefulNotes/MultiCoreProcessor~~ MediaNotes/MultiCoreProcessor though, as the instructions must come from the same unit of execution of a program (typically a thread).

Changed line(s) 127 (click to see context) from:

* Certain types of processors, such as the MediaNotes/GraphicsProcessingUnit, are designed as [[UsefulNotes/FlynnsTaxonomy single-instruction, multiple data (SIMD) processors]], meaning they run the same instruction for a large data set. Branching, while possible, means that only some of the data is processed which reduces the efficiency of the processor.

to:

* Certain types of processors, such as the MediaNotes/GraphicsProcessingUnit, are designed as ~~[[UsefulNotes/FlynnsTaxonomy~~ [[MediaNotes/FlynnsTaxonomy single-instruction, multiple data (SIMD) processors]], meaning they run the same instruction for a large data set. Branching, while possible, means that only some of the data is processed which reduces the efficiency of the processor.

rmctagg09

Changed: 9

Apr 28th 2024 at 2:11:23 PM

Is there an issue? Send a MessageReason:

None

Changed line(s) 90,91 (click to see context) from:

A semi-related note, if one looks into the specs of the CPU, they'll likely find that the CPU has two types of L1 cache: one for instructions and one for data. This might seem odd since everything above this in the [[UsefulNotes/MemoryHierarchy Memory Hierarchy]] allows for both instructions and data to exist in it. This separation mostly has to do with two schools of thought with regards to how [=CPUs=] access memory. The first is the von Neumann architecture, which is where data and instructions can live in the same memory pool. The second is the Harvard architecture, which separates data and instructions in their own memory pools. von Neumann architecture's main benefit is cost (only need one pool of memory and one bus to access it), while Harvard architecture's main benefit is performance (CPU can fetch both instructions and data independently and doesn't have to figure out which one is which). Modern [=CPUs=] use the so-called modified Harvard architecture, where the execution core itself is Harvard, but everything else is von Neumann. In addition, modern memory controllers can also emulate the Harvard architecture with von Neumann implementations by marking sections of memory space as "no execute," making it a data-only space.

to:

A semi-related note, if one looks into the specs of the CPU, they'll likely find that the CPU has two types of L1 cache: one for instructions and one for data. This might seem odd since everything above this in the ~~[[UsefulNotes/MemoryHierarchy~~ [[MediaNotes/MemoryHierarchy Memory Hierarchy]] allows for both instructions and data to exist in it. This separation mostly has to do with two schools of thought with regards to how [=CPUs=] access memory. The first is the von Neumann architecture, which is where data and instructions can live in the same memory pool. The second is the Harvard architecture, which separates data and instructions in their own memory pools. von Neumann architecture's main benefit is cost (only need one pool of memory and one bus to access it), while Harvard architecture's main benefit is performance (CPU can fetch both instructions and data independently and doesn't have to figure out which one is which). Modern [=CPUs=] use the so-called modified Harvard architecture, where the execution core itself is Harvard, but everything else is von Neumann. In addition, modern memory controllers can also emulate the Harvard architecture with von Neumann implementations by marking sections of memory space as "no execute," making it a data-only space.

rmctagg09

Changed: 9

Apr 26th 2024 at 12:19:56 AM

Is there an issue? Send a MessageReason:

None

Changed line(s) 127 (click to see context) from:

* Certain types of processors, such as the UsefulNotes/GraphicsProcessingUnit, are designed as [[UsefulNotes/FlynnsTaxonomy single-instruction, multiple data (SIMD) processors]], meaning they run the same instruction for a large data set. Branching, while possible, means that only some of the data is processed which reduces the efficiency of the processor.

to:

* Certain types of processors, such as the ~~UsefulNotes/GraphicsProcessingUnit,~~ MediaNotes/GraphicsProcessingUnit, are designed as [[UsefulNotes/FlynnsTaxonomy single-instruction, multiple data (SIMD) processors]], meaning they run the same instruction for a large data set. Branching, while possible, means that only some of the data is processed which reduces the efficiency of the processor.

rmctagg09

Changed: 18

Mar 17th 2024 at 2:38:46 PM

Is there an issue? Send a MessageReason:

None

Changed line(s) 7,8 (click to see context) from:

There was just one problem. While computation speeds were increasing quite nicely, memory speeds and capacity were not. As [[UsefulNotes/ProgrammingLanguage higher level programming languages]] were developing and taking off, newer processors were built to support these directly to make the execution code as compact as possible. This happened until about the mid [=1980s=] when computer scientists began to figure out that performing simpler tasks in sequence could be done much quicker, often one instruction per clock cycle. Processors were built upon this new design paradigm and given the name Reduced Instruction Set Computing (RISC). Reduced in this case means the time it takes to complete an instruction is reduced. Retroactively, processors made before then were given the name Complex Instruction Set Computing. RISC proved to be such an efficient method of execution that many manufacturers by the late [=1990s=] had or were making RISC processors. x86 is the only, if not one of the only, CISC [=ISAs=] still in widespread use. However, modern x86 processors are designed such that the instructions are decoded into micro-ops and performed in a RISC-like manner.

to:

There was just one problem. While computation speeds were increasing quite nicely, memory speeds and capacity were not. As ~~[[UsefulNotes/ProgrammingLanguage~~ [[MediaNotes/ProgrammingLanguage higher level programming languages]] were developing and taking off, newer processors were built to support these directly to make the execution code as compact as possible. This happened until about the mid [=1980s=] when computer scientists began to figure out that performing simpler tasks in sequence could be done much quicker, often one instruction per clock cycle. Processors were built upon this new design paradigm and given the name Reduced Instruction Set Computing (RISC). Reduced in this case means the time it takes to complete an instruction is reduced. Retroactively, processors made before then were given the name Complex Instruction Set Computing. RISC proved to be such an efficient method of execution that many manufacturers by the late [=1990s=] had or were making RISC processors. x86 is the only, if not one of the only, CISC [=ISAs=] still in widespread use. However, modern x86 processors are designed such that the instructions are decoded into micro-ops and performed in a RISC-like manner.

Changed line(s) 19,20 (click to see context) from:

In the early days of [[UsefulNotes/ProgrammingLanguage high level programming]], memory was expensive, small in capacity, and the speed gap between it and the CPU was growing. The idea of CISC was to cram several operations into a single instruction (such as "Compare and Exchange 8 bytes") and to allow multiple ways to access the data (encoded as part of the instruction, in a register, or in memory). As the name implies though, there are complexities implementing CISC [=ISAs=] because operations can have variable length instructions and their behavior depends on the parameters passed into it.

to:

In the early days of ~~[[UsefulNotes/ProgrammingLanguage~~ [[MediaNotes/ProgrammingLanguage high level programming]], memory was expensive, small in capacity, and the speed gap between it and the CPU was growing. The idea of CISC was to cram several operations into a single instruction (such as "Compare and Exchange 8 bytes") and to allow multiple ways to access the data (encoded as part of the instruction, in a register, or in memory). As the name implies though, there are complexities implementing CISC [=ISAs=] because operations can have variable length instructions and their behavior depends on the parameters passed into it.

xenol

Added: 2095

Changed: 1915

Jan 15th 2024 at 6:16:57 PM

Is there an issue? Send a MessageReason:

None

Changed line(s) 17,20 (click to see context) from:

Examples: Intel x86, AMD x64

In the early days of [[UsefulNotes/ProgrammingLanguage high level programming]], memory was expensive, small in capacity, and the speed gap between it and the CPU was growing. The idea of CISC was to cram several operations into a single instruction (such as "Compare and Exchange 8 bytes") and to allow multiple ways to access the data (encoded as part of the instruction, in a register, or in memory). As the name implies though, there are complexities implementing CISC [=ISAs=] because instructions are variable length and their behavior depends on the parameters passed into it.

to:

Examples: Intel x86, AMD ~~x64~~

x86-64

In the early days of [[UsefulNotes/ProgrammingLanguage high level programming]], memory was expensive, small in capacity, and the speed gap between it and the CPU was growing. The idea of CISC was to cram several operations into a single instruction (such as "Compare and Exchange 8 bytes") and to allow multiple ways to access the data (encoded as part of the instruction, in a register, or in memory). As the name implies though, there are complexities implementing CISC [=ISAs=] because ~~instructions are~~ operations can have variable length instructions and their behavior depends on the parameters passed into it.

Changed line(s) 30,34 (click to see context) from:

* Supporting integers and floating point only and not special types like Binary Coded Decimal or Strings
* Having a lot of general-purpose registers rather than a smaller set of register with more special-purpose registers. In the beginning, x86 only had 4 out of 14 registers meant for general purpose operations, compared to ARM which has 8-12 registers for general purpose operations. Though most special-purpose registers can be read from/written to by software without restriction, so they gradually became special-purpose in name only (though it helps compilers out when it comes to knowing which registers to use for a certain purpose)

By the 2000s, RISC had largely taken over. While x86 remains in widespread use, most implementations decode and translate it into something more RISC-like.

to:

* Supporting only integers and floating point ~~only~~ data types and not special types like Binary Coded Decimal or Strings
* Having a lot of general-purpose registers rather than a smaller set of register with more special-purpose registers. ~~In the beginning,~~ Originally x86 ~~only~~ had 4 out of 14 registers meant for general purpose operations, compared to ARM which has 8-12 registers for general purpose operations. Though most special-purpose registers can be read from/written to by software without restriction, so they gradually became special-purpose in name only (though it helps compilers out when it comes to knowing which registers to use for a certain purpose)

By the 2000s, RISC had largely taken over. While x86 remains in widespread use, most implementations decode and translate it into something more ~~RISC-like.~~
RISC-like, with the execution core acting more like a RISC processor.

Changed line(s) 37,38 (click to see context) from:

Examples: Intel IA-64, Transmeta Crusoe, MCST Elbrus

to:

Examples: Intel IA-64, Transmeta Crusoe, AMD Terascale, MCST Elbrus

Changed line(s) 45,46 (click to see context) from:

to:

Added line(s) 73,74 (click to see context) :

Most of the improvements listed here are designed to prevent one thing: execution bubbles. Or simply put, execution stalls because something happened, be it waiting for data from main memory or the result of an earlier operation to be completed.

Changed line(s) 73,74 (click to see context) from:

Pipelining in a processor is a way of emulating an assembly line in order to increase throughput. To copy from Website/TheOtherWiki:

to:

Pipelining in a processor is a way of emulating an assembly line in order to increase ~~throughput.~~ throughput, often with the goal being one instruction per clock cycle. Every CPU has pipelining, but what differs is how many stages the instruction goes through before being completed. Some processor families have a two-stage pipeline (instruction fetch -> execute). The classic RISC pipeline uses five stages. In practice, modern processors tend to have at least 8-10 stages.

To copy from ~~Website/TheOtherWiki:~~
Website/TheOtherWiki on how this works:

Changed line(s) 77,78 (click to see context) from:

Pipelining does have its problems though. What kills it is branching such as due to if-statements and loops. If there has to be any branching, then everything that preceded the branch has to be thrown away. Taking the car assembly line example, what if suddenly the insane CEO decides trucks are to be made and cease production on all cars?

to:

~~Pipelining does have its problems though. What kills it~~ The main downside to pipelining, especially when a CPU has lengthy number of stages (known as a deep pipeline), is if there's any branching that needs to be done, such as ~~due to if-statements and loops. If there has to be any branching,~~ an if-statement, then ~~everything that preceded~~ the pipeline stalls until the branch ~~has to~~ is resolved. While branch prediction (discussed later) can help avoid stalls, if it predicts wrong, then there'll still be ~~thrown away.~~ a penalty. Taking the car factory example, if the factory can only work on one type of product at a time and it wants to make another, it has to wait for all the previous stages of the assembly line to complete before it can rework itself to produce the new product.

The number of pipeline stages also determines, of all things, how fast the CPU ''should'' be clocked. Aside from the physical limitations on why [=CPUs=] can't be clocked really fast, there's a theoretical limit based on how long each stage of the pipeline takes to complete. Ideally every stage should take the same amount of time. If for example, ~~what~~ if ~~suddenly~~ we have a 3-stage pipeline where two stages take one clock cycle while the ~~insane CEO decides trucks are~~ other takes two, that slower pipeline stage starts creating a gap that grows up the two cycles it takes to ~~be made and cease production on all cars?~~
complete it.

Added line(s) 89,91 (click to see context) :

Changed line(s) 89,91 (click to see context) from:

!!! Out of Order execution
A type of instruction reordering where some instructions queued for execution in the future can cut in line if they're not dependent on results from an earlier instruction. This is to prevent cases where an earlier instruction can stall the processor if it's waiting for something but later instructions can run right away. The big issue it has is it requires complex hardware to ensure that the output retains the logical ordering, often eating into die space and power consumption. Historically it was kept out of processors meant for small electronics, but has crept its way back in because performance boost it offers now outweighs its drawbacks.

to:

!!! Out of Order ~~execution~~
execution (OOE)
A type of instruction reordering where some instructions queued for execution in the future can cut in line if they're not dependent on results from an earlier instruction. This is to prevent cases where an earlier instruction can stall the processor if it's waiting for something but later instructions can run right away. ~~The big~~ OOE's main issue ~~it has~~ is it requires complex hardware to ensure that the output retains the logical ordering, often eating into die space and power consumption. Historically it was kept out of processors meant for small electronics, but has crept its way back in because performance boost it offers now outweighs its drawbacks.

Changed line(s) 95,99 (click to see context) from:

Superscalar execution is when the execution units of a processor are duplicated. This allows it to run more instructions in a program. For instance, if there are a few math operations that don't depend on each other and there's enough execution units the do them all, they can all run at once. Strictly speaking though, this isn't an implementation of a UsefulNotes/MultiCoreProcessor though, as the CPU core still contains a single program's state.

However, some processor manufacturers found a way to simulate a multicore processor via a technique called simultaneous multithreading (SMT). By allowing multiple program states to be loaded on a CPU core, if there's enough free resources after scheduling work for one state, another state can run on whatever's left. You can think of this like two children sharing a box of crayons or technicians sharing a toolbox rather than everyone having their own.

!!! Microcode, Micro-operations, and Micro-instructions

to:

Superscalar execution is when the execution units of a processor are duplicated. This allows it to run more instructions ~~in a~~ at once for the same program. For instance, if there are a ~~few math~~ number of operations that don't depend on the results of each ~~other and there's enough execution units the do them all,~~ other, then they can all ~~run~~ execute at once. Strictly speaking though, this isn't an implementation of a UsefulNotes/MultiCoreProcessor though, as the ~~CPU core still contains~~ instructions must come from the same unit of execution of a ~~single program's state.~~

program (typically a thread).

However, some processor manufacturers found a way to simulate a multicore processor via a technique called simultaneous multithreading (SMT). By allowing multiple ~~program states~~ units of execution to ~~be loaded on~~ live in a CPU core, ~~if there's enough free resources~~ after scheduling ~~work for~~ one ~~state,~~ unit of execution to run, if there are any execution resources left, another ~~state~~ unit of execution can run on ~~whatever's left.~~ those. You can think of this like two children sharing a box of crayons or technicians sharing a toolbox rather than everyone having their ~~own.~~

own. One famous implementation of this is Intel's [=HyperThreading=] as they were the first company to implement this feature for consumer computers, though most others simply call it SMT.

!!! Microcode, ~~Micro-operations,~~ Micro-instructions, and ~~Micro-instructions~~Micro-operations

Changed line(s) 102,105 (click to see context) from:

In simpler, traditional [=CPUs=], the control unit, which handles how instructions are executed and how data is directed, used to act directly on the instructions from the ISA. However, as [=ISAs=] get more complicated, along with innovative ways to execute instructions get created, having the control unit be directly controlled by the ISA was starting to prove to be a limiting factor. In addition, since the instruction decoder is often hard-wired, any change in how the ISA works or if there's a bug in how the instructions are decoded means an expensive redesign of that part. This is especially true with bugs, as this means older processors will still have a problem.

This is where the micro-code/operation/instruction comes in. Rather than handle the ISA directly, the control unit and execution unit work on its own unique sort of ISA typically called micro-instructions. These map to micro-operations that the execution unit handles. The ISA is translated via microcode, which in its simplest implementation, is a ROM containing a mapping of which ISA instructions map to which micro-instructions. The main benefit of this is that if there's an issue with how the ISA is translated via microcode, it can be patched using an update. Though in most modern [=CPUs=] that employ this, it's stored in SRAM inside the CPU, so any microcode updates will have to be applied again on boot, mostly as a means to make sure the CPU isn't made worse by a bad microcode update or if the user wants to decide whether or not the benefit is worth it (as was the case with Spectre/Meltdown updates).

to:

In simpler, traditional [=CPUs=], the control unit, which handles how instructions are executed and how data is directed, used to act directly on the instructions from the ISA. However, as [=ISAs=] get more complicated, along with innovative ways to execute instructions get created, having the control unit be directly controlled by the ISA was starting to prove to be a limiting factor. In addition, since the instruction decoder is often hard-wired, if there is any change in how the ISA works or ~~if there's~~ a bug in how the instructions are ~~decoded means~~ decoded, it'll require an expensive redesign of that part. This is especially true with bugs, as this means older processors will still have a problem.

This is where the ~~micro-code/operation/instruction~~ micro-code/instruction/operation comes in. Rather than handle the ISA directly, the control unit and execution unit work on its own unique sort of ISA typically called micro-instructions. These map to micro-operations that the execution unit handles. The ISA is translated via microcode, which in its simplest implementation, is a ROM containing a mapping of which ISA instructions map to which micro-instructions. The main benefit of this is that if there's an issue with how the ISA is translated via microcode, it can be patched using an update. Though in most modern [=CPUs=] that employ this, it's stored in SRAM inside the CPU, so any microcode updates will have to be applied again on boot, mostly as a means to make sure the CPU isn't made worse by a bad microcode update or if the user wants to decide whether or not the benefit is worth it (as was the case with Spectre/Meltdown updates).

xenol

Added: 1432

Dec 18th 2023 at 10:06:45 AM

Is there an issue? Send a MessageReason:

Adding a point about process nodes.

Added DiffLines:

!! A note about "Process Node"
The term "Process Node" generally refers to how small the smallest feature of an integrated circuit (IC) is, and everything else is built around that. So for instance, a "10nm process" is meant to depict the idea that the smallest feature in a 10nm built circuit is 10nm. However, this hasn't held true since 1994 and the measurements are a marketing term to avoid having too-technical of a way to describe the way the IC was built. For instance, would you rather have "45nm process" or "High-k dielectric process"?

In addition, not every manufacturing firm produces the same result and their techniques cause various differences despite being marketed with the same process node. For instance, AMD's Zen+ based processors are built from a 12nm process, but the chips themselves are the same size as the previous generation Zen processors which were built from a 14nm process. The key difference was AMD wanted more of a buffer between active components to help with heat transfer. This also led to confusion with where Intel was versus their immediate competitors of TSMC and Samsung. For instance, the Intel 10 process suggests its worse than TSMC or Samsung's 7nm process, but Intel's been able to achieve a transistor density similar to their competition's 7nm process.

Simply put, it's best to interpret the number as nothing more than a generational number, rather than anything of technical value.

xenol

Added: 2220

Changed: 958

Removed: 486

Oct 15th 2023 at 12:15:01 PM

Is there an issue? Send a MessageReason:

None

Added line(s) 41,42 (click to see context) :

Changed line(s) 41,42 (click to see context) from:

to:

Changed line(s) 95,97 (click to see context) from:

to:

!!! ~~Hybrid core~~ Microcode, Micro-operations, and Micro-instructions
These three terms are related, but the gist of is that this solves a problem when the CPU doesn't behave as intended, either due to a poor design or a complex implementation gone wrong.

In simpler, traditional [=CPUs=], the control unit, which handles how instructions are executed and how data is directed, used to act directly on the instructions from the ISA. However, as [=ISAs=] get more complicated, along with innovative ways to execute instructions get created, having the control unit be directly controlled by the ISA was starting to prove to be a limiting factor. In addition, since the instruction decoder is often hard-wired, any change in how the ISA works or if there's a bug in how the instructions are decoded means an expensive redesign of that part. This is especially true with bugs, as this means older processors will still have a problem.

This is where the micro-code/operation/instruction comes in. Rather than handle the ISA directly, the control unit and execution unit work on its own unique sort of ISA typically called micro-instructions. These map to micro-operations that the execution unit handles. The ISA is translated via microcode, which in its simplest implementation, is a ROM containing a mapping of which ISA instructions map to which micro-instructions. The main benefit of this is that if there's an issue with how the ISA is translated via microcode, it can be patched using an update. Though in most modern [=CPUs=] that employ this, it's stored in SRAM inside the CPU, so any microcode updates will have to be applied again on boot, mostly as a means to make sure the CPU isn't made worse by a bad microcode update or if the user wants to decide whether or not the benefit is worth it (as was the case with Spectre/Meltdown updates).

!!! Heterogenous design
An approach that started with mobile devices, ~~hybrid core~~ heterogenous designs (also known as a hybrid CPU architecture) pair fewer, larger high performance CPU cores with more, smaller high efficiency cores. The idea is that the high performance cores (typically called P-cores) can handle tasks that where faster performance is preferable or where low response times are needed. The high efficiency cores (typically called E-cores) can handle background or time-insensitive tasks or help boost the performance of the P-cores. Some hybrid designs may include 3 or more performance layers, but usually only 2 are used.

Changed line(s) 163,166 (click to see context) from:

As processors got more complex, these hardware bugs become more costly to fix because it requires another revision of the hardware. Also by the time the bug is discovered and characterized, it's likely thousands and thousands of units were sold and so you're left with people who have the issue. Often the company must have a free-replacement program in order to not be sued. Software can mitigate the problem by not running the offending instructions or working around them, but as long as the hardware bug exists, there's always the potential of the computer crashing or worse.

Then CPU manufacturers started designing [=ISAs=] to translate instructions into microcodes to run micro-ops. These [=CPUs=] were also designed so the microcode can be updated. Since most of the core operations are handled by microcodes, bugs or problems with the microcode can be fixed with an update. However, microcodes only apply to the execution side of the CPU, not the instruction decoding and scheduling side, so there's still the potential of hardware bugs if they exist there.

to:

As processors got more complex, these hardware bugs become more costly to fix because it requires another revision of the hardware. Also by the time the bug is discovered and characterized, it's likely thousands and thousands of units were sold and so you're left with people who have the issue. Often the company must have a free-replacement program in order to not be sued. Software can mitigate the problem by not running the offending instructions or working around them, but as long as the hardware bug exists, there's always the potential of the computer crashing or worse.

Then CPU manufacturers started designing [=ISAs=] to translate worse. This is where microcode designs come in, but since this only affects how instructions ~~into microcodes to run micro-ops. These [=CPUs=] were also designed so the microcode can be updated. Since most of the core operations~~ are handled ~~by microcodes, bugs or problems with the microcode can be fixed with an update. However, microcodes only apply~~ before being sent off to the execution ~~side of the CPU, not the instruction~~ unit, any bugs past decoding ~~and scheduling side, so there's~~ can still ~~the potential of hardware bugs if they exist there.~~
present problems.

xenol

Changed: 27

Aug 29th 2023 at 9:53:15 AM

Is there an issue? Send a MessageReason:

None

Changed line(s) 29,30 (click to see context) from:

* Having a lot of general-purpose registers rather than a smaller set of register with more special-purpose registers. In the beginning, x86 only had 4 out of 14 registers meant for general purpose operations, compared to ARM which has 8-12 registers for general purpose operations. Though most special-purpose registers can be read from/written to by software without restriction, so they evolved to special-purpose in name (though it helps compilers out when it comes to knowing which registers to use for a certain purpose)

to:

* Having a lot of general-purpose registers rather than a smaller set of register with more special-purpose registers. In the beginning, x86 only had 4 out of 14 registers meant for general purpose operations, compared to ARM which has 8-12 registers for general purpose operations. Though most special-purpose registers can be read from/written to by software without restriction, so they ~~evolved to~~ gradually became special-purpose in name only (though it helps compilers out when it comes to knowing which registers to use for a certain purpose)

xenol

Changed: 652

Aug 29th 2023 at 9:52:54 AM

Is there an issue? Send a MessageReason:

None

Changed line(s) 26 (click to see context) from:

to:

* Making uniform instruction sizes. While using fixed instruction sizes for the entire ISA is often the goal, some [=ISAs=] may allow for variable length instructions, but there are mechanisms in place to limit the variability. For instance, ARM has 16, 32, and 64-bit instructions, but it can only execute them in a specific CPU mode. Another example is MIPS introduced 16, 32, and 48-bit instruction sizes, but different opcodes represent the same operation ~~(such as~~ of different instruction sizes (e.g., a 16-bit ADD instruction vs or a 48-bit ~~one) is represented by~~ ADD instruction). This contrasts with say [=x86=] where the same operation maps to the same opcode, but it may have multiple versions of it of different ~~opcodes. This is unlike x86, where instruction sizes can be~~ sizes, with some instructions being up to ~~15 bytes and there must be some decoding beforehand to determine this.~~15-bytes in length.

Changed line(s) 29,30 (click to see context) from:

* Making most registers general purpose rather than have special-purpose registers. In the beginning, x86 only had 4 out of 14 registers meant for general purpose operations, compared to ARM which has 8-12 registers for general purpose operations.

to:

* ~~Making most~~ Having a lot of general-purpose registers ~~general purpose~~ rather than ~~have~~ a smaller set of register with more special-purpose registers. In the beginning, x86 only had 4 out of 14 registers meant for general purpose operations, compared to ARM which has 8-12 registers for general purpose ~~operations.~~
operations. Though most special-purpose registers can be read from/written to by software without restriction, so they evolved to special-purpose in name (though it helps compilers out when it comes to knowing which registers to use for a certain purpose)

xenol

Added: 2446

Changed: 22

Aug 10th 2023 at 2:53:15 PM

Is there an issue? Send a MessageReason:

None

Changed line(s) 95 (click to see context) from:

to:

!!! Hybrid core design
An approach that started with mobile devices, hybrid core designs pair fewer, larger high performance CPU cores with more, smaller high efficiency cores. The idea is that the high performance cores (typically called P-cores) can handle tasks that where faster performance is preferable or where low response times are needed. The high efficiency cores (typically called E-cores) can handle background or time-insensitive tasks or help boost the performance of the P-cores. Some hybrid designs may include 3 or more performance layers, but usually only 2 are used.

Companies started considering this as they realized for certain applications, only a few tasks need high-performance or low-latency. Especially in mobile applications where the only time where the device needs high performance is to service requests with a hard time limit (such as talking to a cell phone tower) or with games (at least those with more flair than ''VideoGame/CandyCrushSaga''). Since processor design has to slide between performance or efficiency and with higher performing cores gulping up more power, it didn't make sense to service background tasks on these cores. So the idea of throwing tasks on power efficient cores, where people wouldn't really notice a loss in performance but, in theory, improve battery life. It has since made its way to desktop and laptop computers, such as Intel's Alder Lake.

The trick with this design is to make sure the E-cores aren't so performance deficient that they spend more energy overall doing the task. That is, if an E-core is only 50% as performant as a P-core, it needs to consume well below 50% of the power, otherwise it's no different than running it on a P-core, or worse. In addition, you need to be able to stuff more E-cores in the same die space as a P-core to make the value proposition better. In Intel's case, they were able to fit 4 E-cores in the same space as a P-core, with the E-core performing about 60% as good as a P-core while consuming a quarter of the power (so 4 E-cores = 1 P-core in power consumption). In theory this means that a 4 E-core cluster has better efficiency in performing a task than 1 P-core.

Another major issue is the software has to be aware of this, as software generally thinks there's only one core type. You don't want a high performance task running on an E-core after all. The OS usually handles scheduling software though, so most developers won't have to think about this.

xenol

Changed: 496

Dec 16th 2022 at 10:48:22 AM

Is there an issue? Send a MessageReason:

None

Changed line(s) 26 (click to see context) from:

* Making uniform instruction sizes. Note that while ARM in its entirely technically doesn't have this, as it has 16, 32, and 64-bit instructions, they can only be executed in appropriate CPU modes. This is unlike x86, where its instructions can be up to 15 bytes long and be executed in any mode as long as it supports that instruction.

to:

* Making uniform instruction sizes. ~~Note that while~~ While using fixed instruction sizes for the entire ISA is often the goal, some [=ISAs=] may allow for variable length instructions, but there are mechanisms in place to limit the variability. For instance, ARM ~~in its entirely technically doesn't have this, as it~~ has 16, 32, and 64-bit instructions, ~~they~~ but it can only ~~be executed~~ execute them in ~~appropriate~~ a specific CPU ~~modes.~~ mode. MIPS introduced 16, 32, and 48-bit instruction sizes, the same operation (such as a 16-bit ADD instruction vs a 48-bit one) is represented by different opcodes. This is unlike x86, where ~~its instructions~~ instruction sizes can be up to 15 bytes ~~long~~ and there must be ~~executed in any mode as long as it supports that instruction.~~some decoding beforehand to determine this.

TheUnsquished

Changed: 7

Sep 9th 2022 at 10:12:04 AM

Is there an issue? Send a MessageReason:

Wiki/ namespace clean up.

Changed line(s) 69,70 (click to see context) from:

Pipelining in a processor is a way of emulating an assembly line in order to increase throughput. To copy from Wiki/TheOtherWiki:

to:

Pipelining in a processor is a way of emulating an assembly line in order to increase throughput. To copy from ~~Wiki/TheOtherWiki:~~
Website/TheOtherWiki:

xenol

Added: 1264

Apr 6th 2022 at 1:23:39 PM

Is there an issue? Send a MessageReason:

None

Added DiffLines:

An alternative to branching is something called predication. That is instead of evaluating a value and jumping to certain parts of code depending on the value, save the result in a register, then tag the affected instructions with that register to determine whether it will be executed or not. Here's an example of what this might look like:
@@if (number < 0)\\
number = number - 1\\
else\\
number = number + 1\\
# Branching code\\
cmp number, 0 # Compare number to 0\\
bz Path_1 # If number = 0, jump to Path_1\\
jmp Path_2 # Jump to Path_2 otherwise\\
Path_1:\\
sub 1, number # Subtract 1 from number\\
jmp End\\
Path_2:\\
add 1, number # Add 1 to number\\
End:\\
ret # Return\\
\\
# Predicated code\\
cmp number, 0, p1, p2 # Compare number to 0, store result in p1, the opposite result in p2\\
(p1) sub 1, number # If p1 is true, subtract 1 from number\\
(p2) add 1, number # If p2 is true, add 1 to number\\
ret\\
@@

Some [=CPUs=] over time have implemented forms of predication, but a major disadvantage of it is you're still executing the same number of instructions regardless of which path you take. This penalizes taking shorter paths because if the longer path is long enough, predication will perform worse than simply branching.

xenol

Added: 1674

Changed: 1854

Nov 1st 2021 at 4:15:46 PM

Is there an issue? Send a MessageReason:

None

Added line(s) 21,24 (click to see context) :

Note that instruction sets are primarily the interface between software and hardware. It is the "language" that the two use to speak to each other. How well the CPU actually performs depends on the implementation, not the instruction set itself. For example, while ARM is RISC based, for the longest time none of its implementations could beat an x86 one, despite x86 being the "slower" CISC instruction set.

Changed line(s) 21,24 (click to see context) from:

Developed in the 70s, the principle of RISC is to reduce the time it takes to complete each instruction. It was found even on a CISC machine, that doing simpler instructions were faster than doing one complex instruction. Soon after processors were designed with the aim of reducing the amount of time it takes to complete an instruction, with goal being one instruction per clock. To do this, RISC architectures typically simplify the instruction decoding process by making uniform instruction sizes, reducing memory access types that an instruction can do, supporting fewer data types (e.g., integers and floating point only, no special types like Binary Coded Decimal or Strings), and most registers can be used for anything, rather than having registers for specific purposes. Ultimately to achieve this simplicity, RISC operations that modify data can only perform said operations within its registers, using load/store instructions to move data around. Appropriately, this is called a [[ExactlyWhatItSaysOnTheTin load/store architecture]]

By the 2000s, RISC had largely taken over. While the x86 remains in widespread use, most implementations of it by then used an instruction translator to convert it to RISC type instructions before being processed.

to:

Developed in the 70s, the principle of RISC is to reduce the time it takes to complete each instruction. It was found even on a CISC machine, that doing a string of simpler instructions ~~were~~ was faster than doing one complex ~~instruction.~~ instruction that handles everything those simpler instructions were doing. For example, a CISC CPU can have an operation like "add the numbers in memory locations A and B" using the same "ADD" mnemonic. But "ADD" can also mean "add the numbers in register A and memory location B" or "add the value A to memory location B". The CISC CPU has to spend extra cycles decoding the actual intent of the instruction.

Soon after processors were designed with the aim of reducing the amount of time it takes to complete an instruction, with goal being one instruction per clock. To do this, RISC architectures ~~typically~~ simplify the instruction ~~decoding~~ execution process ~~by making~~ by:
* Making uniform instruction ~~sizes, reducing~~ sizes. Note that while ARM in its entirely technically doesn't have this, as it has 16, 32, and 64-bit instructions, they can only be executed in appropriate CPU modes. This is unlike x86, where its instructions can be up to 15 bytes long and be executed in any mode as long as it supports that instruction.
* Reducing memory access types that an instruction can ~~do, supporting fewer~~ do. Operations that modify data ~~types (e.g.,~~ can only do so on registers or with an immediate value (a value that's part of the instruction). Contrast to x86 which can do a combination of registers, immediate values, or memory locations.
* Supporting integers and floating point ~~only, no~~ only and not special types like Binary Coded Decimal or ~~Strings), and~~ Strings
* Making most registers ~~can be used for anything,~~ general purpose rather than ~~having~~ have special-purpose registers. In the beginning, x86 only had 4 out of 14 registers meant for general purpose operations, compared to ARM which has 8-12 registers for specific purposes. Ultimately to achieve this simplicity, RISC operations that modify data can only perform said operations within its registers, using load/store instructions to move data around. Appropriately, this is called a [[ExactlyWhatItSaysOnTheTin load/store architecture]]

general purpose operations.

By the 2000s, RISC had largely taken over. While ~~the~~ x86 remains in widespread use, most implementations of decode and translate it ~~by then used an instruction translator to convert it to RISC type instructions before being processed.~~
into something more RISC-like.

Changed line(s) 33,34 (click to see context) from:

to:

An evolution of VLIW, the idea of NISC is to instead of compiling software into instructions and let the processor figure out how issue those instructions on its resources, the compiler can figure out where the data being fed into the processor goes and the instructions are only telling the processor how this data flow should work. That is, if you think of a processor's execution units (add, shifting, etc) as modules, you only need to say "the input of this module reads from here and the output of this module goes to there."
" While technically needing instructions to direct the data flow, this is not the same as the processor needing to figure out an instruction is an ADD instruction.

In a real life analogy, if you're in a cafeteria with various stations and someone staffing them to hand out food. Instead of saying "get some turkey, get some green beans, and get some mashed potatoes", which requires a person to recognize what and where they are, you could say "go to stations 1, 3, and 6." You get the same result, but the latter is simpler to resolve. This has the obvious downside that NISC-based software only works for an exact implementation of hardware.

History Analysis / CentralProcessingUnit