A superscalar CPU architecture implements a form of parallelism on a single chip, thereby allowing the system as a whole to run much faster than it would otherwise be able to at a given clock speed. A superscalar architecture fetches, executes, and returns results from more than one (standard) instruction during a single pipeline stage (typically this means a single clock cycle).
The simplest processors are scalar processors. A scalar processor processes one data item at a time. In a vector processor, by contrast, a single instruction operates simultaneously on multiple data items. The difference is analogous to the difference between scalar and vector arithmetic. A superscalar processor is sort of a mixture of the two. Each instruction processes one data item, but there are multiple processing units so that multiple instructions can be processing separate data items at the same time.
A superscalar processor usually sustains an execution rate in excess of one instruction per machine cycle. But just processing multiple instructions at the same time does not make an architecture superscalar. Simple pipelining, where a CPU may be loading an instruction while doing arithmetic for the previous one and storing the results from the one before that (thus executing 3 instructions at the same time) is not superscalar processing.
In a superscalar CPU, there are several functional units of the same type, along with additional circuitry to dispatch instructions to the units. For instance, most superscalar designs include more than one integer unit (typically referred to as an ALU). The dispatcher reads instructions from memory and decides which ones can be run in parallel, dispatching them to the two units.
Performance of the dispatcher is key to the overall performance of a superscalar design. The task is not a simple one. The instructions a = b + c; d = e + f can be run in parallel because none of the results depend on other calculations. However, the instructions a = b + c; d = a + f may or may not be able to run in parallel, depending on the order in which the instructions complete as they move through the units.
Much of modern CPU design is dedicated to increasing the accuracy of the dispatcher system, and allowing it to keep the multiple units in use at all times. This has become increasingly important as the number of units has increased. While early superscalar CPUs would have two ALUs and a single FPU, a modern design like the PowerPC 970 include four ALUs and two FPUs, as well as two SIMD units. If the dispatcher is ineffective at keeping all of these units fed with instructions, the performance of the system as a whole will suffer greatly.
Seymour Cray's CDC 6600 from 1965 is often mentioned as the first superscalar design. The Intel i960CA (1988) and the AMD 29000-series 29050 (1990) microprocessors were the first commercial single-chip superscalar microprocessors. RISC CPUs like these brought the superscalar concept to micro computers because the RISC design results in a simple core, allowing straightforward instruction dispatch and the inclusion of multiple functional units (such as ALUs) on a single CPU in the constrained design rules of the time. This was the reason that RISC designs were faster than CISC designs through the 1980s and into the 1990s.
Except for CPUs used in some battery-powered devices, essentially all general-purpose CPUs developed since about 1998 are superscalar. Beginning with the "P6" (Pentium Pro and Pentium II) implementation, Intel's 80386 architecture microprocessors have implemented a CISC instruction set on a superscalar RISC micro-architecture. Complex instructions are internally translated to a RISC-like "micro-ops" RISC instruction set, allowing the processor to take advantage of the higher-performance underlying processor while remaining compatible with earlier Intel processors.
Dramatic improvements in the quality of the control unit now appear unlikely, limiting future improvements in speed of the basic superscalar design. One potential solution to this problem is to move the dispatcher logic out of the chip and into the compiler, which can spend considerably more time and effort on making the best decisions possible. This is the basic premise of very long instruction word (VLIW) CPU designs, which is also known as static superscalar or compile time scheduling.
Superskalarität | Superescalar | Processeur superscalaire | Microprocessore superscalare | スーパースケーラ | Суперскалярный процессор | 超純量
This article is licensed under the GNU Free Documentation License.
It uses material from the
"Superscalar".
Home Page • arts • business • computers • games • health • hospitals • home • kids & teens • news • physicians • recreation• reference • regional • science • shopping • society • sports • world