Talk:Superscalar processor

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Timelines[edit]

'Beginning with the 'P6' (Pentium Pro and Pentium II) implementation, Intel's 80386'

Uh - this is wrong. The 386 is way before any Pentium (586) or successor. —The preceding unsigned comment was added by 62.1.133.99 (talkcontribs) 18:05, 19 February 2007 (UTC).[reply]

Uh - that is an incomplete quotation. The full quotation is "Beginning with the "P6" (Pentium Pro and Pentium II) implementation, Intel's 80386 architecture microprocessors" (emphasis mine); the clause with "80386" in it is referring to the 32-bit x86 (IA-32) architecture and processors that implement it, of which the first was the 80386, followed by the 80486, the Pentium, the Pentium Pro, etc.. Guy Harris 03:55, 21 February 2007 (UTC)[reply]
Note that the first x86 superscalar was the Pentium, not the Pentium Pro. Quote from the 1995 Developer Manual, Volume 3, Chapter 2, paragraph 2:

The Intel Pentium processor, like its predecessor the Intel486 microprocessor, is fully software compatible with the installed base of over 100 million compatible Intel Architecture systems. In addition, the Intel Pentium processor provides new levels of performance to new and existing software through a reimplementation of the Intel 32-bit instruction set architecture using the latest, most advanced, design techniques. Optimized, dual execution units provide one-clock execution for "core" instructions, while advanced technology, such as superscalar architecture, branch prediction, and execution pipelining, enables multiple instructions to execute in parallel with high efficiency. Separate code and data caches combined with wide 128-bit and 256-bit internal data paths and a 64-bit, burstable, external bus allow these performance levels to be sustained in cost-effective systems. The application of this advanced technology in the Intel Pentium processor brings "state of the art" performance and capability to existing Intel architecture software as well as new and advanced applications.

Who was first?[edit]

I'm having problems with the statement that the Cray was the first superscalar system. I'm currently studying the paper by Tomasulo written in 1965, which describes a modified IBM 360 system that was developed to be superscalar. The paper introduces the concepts of reservation stations and the common data bus, so the fact that it has not been mentioned makes me wonder what's going on in the history section. It's probably likely that the Cray was developed in parallel. Also, this modified IBM was likely only ever used for research purposes-but that sure as hell doesn't mean it wasn't the first superscalar computer. The paper is dated September 16th, 1965, and is called "An efficient algorithm for exploiting multiple arithmetic units". In this paper he make no reference to any currently existing superscalar systems, which led me to question the position of his research in the superscalar development timeline. If anyone wants to clear up this mess, be my guest; I have to finish this review in a few hours and sleep. —Preceding unsigned comment added by 82.10.136.208 (talk) 01:57, 27 October 2008 (UTC)[reply]

Definition of "superscalar"[edit]

This definition of the term "superscalar" is too loose. Implementing "a form of parallelism" fails to distinguish it from even pipelined architectures, let alone VLIW architectures. Also, lack of a good definition appears to have lead to the arguments over the Intel i860. Hennessy and Patterson (Computer Architecture - A Quantitative Approach, 2nd Ed., 1996) indicate that to be superscalar, a machine needs to issue more than one instruction per cycle (which is obviously not the same as executing more than one per cycle). This would then classify machines that have multiple functional units capable of operating in parallel, but only issue one instruction per cycle as scalar (e.g. Sun's microSPARC-II). VLIW architectures are also effectively multiple-issue, but the number of instructions issued is fixed (and determined by the compiler), but in a superscalar machine, the number can vary and are typically determined dynamically. Having said this, modern compilers schedule for superscalar machines based on knowledge of the capabilities of the dependency-checking mechanisms of the target processors, so it would be more correct to say that a combination of static and dynamic scheduling is used.

84.92.139.115 17:17, 26 December 2005 (UTC)Marcus[reply]

No one who was actually there during the RISC wars of the mid- to late 1980s has any doubt about the definition of superscalar. The citation from Hennessy & Patterson quoted above is correct. The trolls in the Intel i860 page have no idea what they are talking about. Things like scheduled superscalar are, most generously, academic fictions invented after the fact. All early RISC processors required compiler scheduling. To repeat my credentials -- I was on the Intel i960 design team, wrote and presented papers about superscalar architecture. All that said, the page is (as you right point out) vague and weasle-worded. It could use some tightening and abbreviation. -- Gnetwerker 08:28, 27 December 2005 (UTC)[reply]

Unfortunately, the original definition of superscalar seems to have been loosely applied, even in academic circles and with the manufacturers there appears to be some confusion. For example, Sun's own documentation for the microSPARC-II processor ("The microSPARC-II Processor - Technology White Paper", 1995) first argues against the use of superscalar architectures for low-end processor designs, and then (when comparing with the MIPS R4600) refers to the microSPARC-II as "the superscalar, pipelined microSPARC-II". (The comparison with the R4600 isn't in the 1994 edition of the same white paper, so Sun's marketing department could be at fault here.) As another example, take the i960. The Wikipedia page says "The i960 architecture also anticipated a superscalar implementation, with instructions being simultaneously dispatched to more than one unit within the processor." Are we distinguishing between instruction issue and dispatch as Sima does (Dezsõ Sima, "Superscalar Instruction Issue", IEEE Micro, 17(5), 1997) or something else? As far as I can tell, the i960 has a peak throughput of 1 instruction per cycle (25 MIPS @ 25 MHz), and thus issues at most one instruction per cycle, making it scalar. Since you were on the i960 team, you should be able to provide me with a definitive explanation here! :o) 84.92.139.115 19:20, 27 December 2005 (UTC)Marcus[reply]

A few things are clear: first, the chip companies, Intel chief among them, are guilty of using the term "Superscalar" as more of a marketing term than anything else. Second, the term has joined "RISC" in becoming a general synonym for "good". However, I would point to Mike Johnson's book Superscalar Microprocessor Design should be the definitive text. It was published in December 1990 -- roughly contemporaneously with the i960CA and the AMD29050 implementations.

Regarding the i960 page, the phrase referred to ("The i960 architecture also anticipated a superscalar implementation") refers to the fact that the original architecture (i.e. macro-architecure or ISA), though a simple and even CISC-like scalar implementation, contained a RISC subset that was amenable to the ultimate (i960CA) superscalar implementation. Glen Myers, who wrote "Advances in Computer Arcitecture" (1978) was the 960 designer, and had studied and written about (and worked on) some of the superscalar mainframes like the Stretch. When we built the i960CA, it could issue an ALU instruction, a memory load or store, and a brnach in one cycle. It could not sustain this 3-instruction speed, though, since you couldn't instruction-fetch and dispatch a branch every 3 insns. However, under the right conditions (i.e. reading code frmo the cache and no load/store delays) sustain 2 instructions/second. The first i960CA ran at 33MHz, hence the "66 MIPS" t-shirt hanging in my closet. We were rightly criticized at the time that this was a "guaranteed not to exceed" speed. Hope this answers your questions. -- Gnetwerker 23:23, 27 December 2005 (UTC)[reply]

Thanks for the clarification (and the rereference... it's in the library, so I'll have a read)! Thanks again for your help. :o) 84.92.139.115 14:12, 28 December 2005 (UTC)Marcus[reply]

Superscalar dispatch limit[edit]

Attempted to explain a key limitation of the superscalar approach to further performance increases. A common and obvious question is if superscalar works, why not just keeping doing it even more. Tried to explain that. Joema 14:58, 17 November 2006 (UTC)[reply]

The one thing missing from your otherwise excellent explanation is the issue of non-interlocked simultaneous dispatch. This requires the compiler to schedule instructions in order to achieve the correct result, on the assumption that the compiler has better information about such interdependencies. On the other hand, the 5-6x limit is correct, because of underlying interdependencies in scalar code, not because of the implementation complexity of interlocking. This is discussed in Mike Johnson's book (the result of his thesis), and elsewhere at the time (from memory, I don't have a cite). Your para leads the reader to assume that superscalar reached its design endpoint, when in fact it got somewhat pushed to the side, as both Intel and AMD, who were (and are) the dominant uP designers, both shifted their RISC design teams to x86 architectures, which require interlocks because of legacy code and dumb compilers. I haven't changed anything. If you want to take a swing at it, please do (assuming you agree), otherwise I'll wait a while for a response and then give it a whack. -- Gnetwerker 18:14, 17 November 2006 (UTC)[reply]
Oops I made a few more changes before I read the above. Please feel free to change anything you want. I'm not an expert at this area, just trying to make the main points and issues of superscalar design vs other approaches more crisp.
However I've always thought the 5-6x dispatch limit was due to dispatcher implementation complexity and associated delay factors. The degree of dispatcher cost varies based on several assumptions: whether out-of-order-issuing, instruction set cardinality, etc. However it seems to rise geometrically with dispatch width. The Cotofana paper seems to corroborate that (sorry, poscript format only): [1]. If I'm wrong, feel free to change the article as needed. Just trying to answer two obvious questions I think many readers will have: (1) What's the difference between superscalar vs other approaches, and (2) "if superscalar works, why not just do it more rather than fool with VLIW, etc?". Joema 23:56, 17 November 2006 (UTC)[reply]
Figure 3-3 on pg. 40 of Mike Johnson's book (see References in the article) shows that across 18 selected programs, the inherent maximum execution rates (i.e. the underlying parallelism) had a mean of about 5.6 instructions/cycle. There are additional hardware considerations (cache, pipeline hazard detection) that combine to slow the practical limit for an implementation to about 2.5x a similar scalar machine. Johnson is the primary published author on this topic. He, John Mashey (MIPS), and Steven McGeady (Intel) did much of the research on this in the 1980s. -- Gnetwerker 00:50, 18 November 2006 (UTC)[reply]
Thanks, yes we're limited by the intrinsic instruction level parallelism in existing code. But my point was aside from this, there's a hardware limit imposed by the geometrically increasing overhead required for dependency checking in an out-of-order superscalar design. Programs can be recompiled and improved compiler technology can expose more parallelism. However an out-of-order superscalar design that requires hardware dependency checking imposes a limit that can't be circumvented without a fundamental architectural change. What this limit is varies based on instruction set size and issue width. However the limit (mainly associated gate/wiring delays) rises so quickly with issue width that it caps clock speed. Is that not the correct understanding? Joema 04:49, 18 November 2006 (UTC)[reply]
Well, Johnson's research is too complex to go into in great detail here, but one of his findings is that caches, rather than pipeline hazards, as he calls them, are more the limiting factor. He does devote a whole chapter to software vs. hardware implementation of dependency checking, so this may be a non-answer for you. If you assume that you can achieve near-optimal scheduling at compile-time, then you'll never worry about hardware complexity in pipeline interlocking, you'll always find the problem elsewhere. There is, of course, much research from early out-of-order work (Tomasulu, et al) on the overall complexity of that, but superscalar is a small subset thereof. I really suggest you read Johnson's book. -- Gnetwerker 06:47, 18 November 2006 (UTC)[reply]
I'll try to get Johnson's book. Made further changes to clarify hardware dependency checking isn't the only limitation to achievable superscalar speedup: Even given infinitely fast dependency checking hardware, intrinsic parallelism in the instruction stream still limits available speedup. If there's anything worded wrong, please feel free to change it.
As you said, if you do compile time scheduling, you thus avoid the burden of hardware dependency checks -- assuming you never run legacy code or run it in only a compatibility mode. But barring this, I think the checks must be done, even if there's only a small probability of dependencies. IOW you can only jettison the dependency checking logic (and associated clock speed cost) if compile time scheduling is perfect, as it's assumed to be with VLIW. Let me know if I'm off base on this. Joema 23:20, 19 November 2006 (UTC)[reply]
I'm not sure that introducing VLIW into the discussion is appropriate -- VLIW is explicit parallelism and most superscalar discussion deal more with implicit. That is, a VLIW "instruction" is still just one, very long, instruction. The advantage of hardware versus software instruction scheduling is best seen in the Intel Itanium -- instructions that can't be scheduled in parallel, due to dependencies, result in NO-OPs in the instruction words(and this results in underutilization of the various execution units, see http://www.computer.org/portal/web/csdl/doi/10.1109/IMSCCS.2006.37). Thus executable size is dependent on lack of parallelism. The less the amount of available parallelism, the larger the executable.Tall Girl (talk) 03:08, 5 June 2011 (UTC)[reply]

Initial description[edit]

Tried to improve initial description, esp. regarding superscalar vs pipelining. While virtually all superscalar CPUs are pipelined, it's important to differentiate the two in order to clarify what superscalar actually is. Joema (talk) 14:13, 5 December 2007 (UTC)[reply]

I don't really understand why the difference between a functional unit and a processor core should be explained here, since there is little potential for confusion. Rilak (talk) 14:42, 5 December 2007 (UTC)[reply]
That was my thought, but someone requested clarification, thus my effort: [2] Joema (talk) 02:49, 6 December 2007 (UTC)[reply]
The requested clarification only said that superscalar and pipelining should be explained more clearly... How did processor cores come into this? I think that the statement in question should be changed to this: "A superscalar processor executes more than one instruction per a clock cycle by simultaneously issuing multiple instructions to multiple execution units." Rilak (talk) 07:32, 6 December 2007 (UTC)[reply]
He mentioned processor cores in his question. Saying "issuing multiple instructions to multiple execution units" could imply multiple cores, unless clarified. We need to describe superscalar in a way that's unambiguous and distinct from pipelining and multi-core CPUs. Ideally it should be worded such that a technically literate, but non-professional casual reader can understand it: Wikipedia:Make technical articles accessible. If you can re-word in a way that accomplishes all these, feel free to make the changes. Joema (talk) 13:23, 6 December 2007 (UTC)[reply]
There are two entirely different concepts being discussed -- processor cores, and the individual execution units within each core. And that's something that any technically literate user should understand -- that a core contains multiple components each of which perform different functions. For example, adders, shifters/rotaters, multipliers, etc. Within that single core it is possible, provided the bus width and other supporting logic is available, to issue two instructions, one to an adder and another to a shifter, provided there are no dependencies between the two operations. For example, the statement "a = (b + c) * (d / 2)" can be dispatched so that "b + c" are calculated by the adder and "d / 2" by the shifter. The hypothetical multiplier can't produce the product since it is dependent on the results of the first two instructions (which are being executed simultaneously in different execution units). BUT, if a "e * 3" term were added, and the input bus made wide enough (and the logic needed to handle the new dependency calculation complexity added) to fetch all three instructions with one go, it could be assigned to the multiplier unit and all three terms computed in parallel.Tall Girl (talk) 03:08, 5 June 2011 (UTC)[reply]

India Education Program course assignment[edit]

This article was the subject of an educational assignment supported by Wikipedia Ambassadors through the India Education Program.

The above message was substituted from {{IEP assignment}} by PrimeBOT (talk) on 19:57, 1 February 2023 (UTC)[reply]