SSE2 is one of the IA-32 SIMD instruction sets, first introduced by Intel with the initial version of the Pentium 4 in 2001. It extends the earlier version SSE instruction set, and is intended to fully supplant MMX. SSE2 has itself been extended by SSE3, also known as "Prescott New Instructions", introduced by Intel to the Pentium 4 in early 2004. It has 144 new instructions from SSE which has 70 instructions.
Rival chip-maker AMD added support for SSE2 with the introduction of their Opteron and Athlon 64 ranges of 64-bit CPUs in 2003, and in 2005 added support for the SSE3 instruction set with an updated "E" revision of their processors.
SSE2 adds support for 64-bit double-precision floating point and for 64, 32, 16 and 8-bit integer operations on the eight 128-bit XMM registers first introduced with SSE. SSE2 adds no additional program state to that provided by SSE.
The addition of 128-bit integer SIMD operations allows the programmer to completely avoid the eight 64-bit MMX registers "aliased" on the original IA-32 floating point register stack. This permits mixing integer SIMD and scalar floating point operations without mode switching required between MMX and x87 floating point operations. However, this is overshadowed by the value of being able to perform integer SIMD operations on the wider SSE registers.
Other SSE2 extensions include a set of cache-control instructions intended primarily to minimize cache pollution when processing indefinite streams of information, and a sophisticated complement of numeric format conversion instructions.
AMD's implementation of SSE2 on the AMD64 platform includes an additional 8 registers, doubling the total number to 16 (XMM0 through XMM15). These additional registers are only visible when running in 64-bit mode. Intel adopted these additional registers as part of their support for AMD64 architecture (renamed EM64T) in 2004.
A notable problem occurs when a compiler must interpret a mathematical expression consisting of several operations (adding, subtracting, dividing, multiplying). Depending on the compiler (and optimizations) used, different intermediate results of a given mathematical expression may need to be temporarily saved, and later reloaded. This results in a truncation from 80-bits to 64-bits in the x87 FPU. Depending on when this truncation is executed, the final numerical result may end up different. The following Fortran code compiled with G95 is offered as an example.
program hi real a,b,c,d real x,y,z a=.013 b=.027 c=.0937 d=.79 y=-a/b + (a/b+c)*EXP(d) print *,y z=(-a)/b + (a/b+c)*EXP(d) print *,z x=y-z print *,x end
# g95 -o hi -mfpmath=387 -fzero -ftrace=full -fsloppy-char hi.for # ./hi 0.78587145 0.7858714 5.9604645E-8
# g95 -o hi -mfpmath=sse -msse2 -fzero -ftrace=full -fsloppy-char hi.for # ./hi 0.78587145 0.78587145 0.
This is just as easily shown on any other platform, with any other programming language, such as the Windows version of G95.
C:\>g95 -o hi -mfpmath=387 -fzero -ftrace=full -fsloppy-char hi.for C:\>hi
0.78587145 0.7858714 5.9604645E-8
The Intel C Compiler can automatically generate SSE/SSE2-code without the use of hand-coded assembly, letting programmers focus on algorithmic development instead of assembly-level implementation. Since its introduction, the Intel C Compiler has greatly increased adoption of SSE2 in Windows application development.
The following CPUs do not support SSE2.
x86 architecture | Parallel computing
Streaming SIMD Extensions 2 | Streaming SIMD Extension 2 | SSE2 | SSE2 | SSE2 | SSE2