The IEEE Standard for Binary Floating-Point Arithmetic (IEEE 754) is the most widely-used standard for floating-point computation, and is followed by many CPU and FPU implementations. The standard defines formats for representing floating-point numbers (including ±zero and denormals) and special values (infinities and NaNs) together with a set of floating-point operations that operate on these values. It also specifies four rounding modes and five exceptions (including when the exceptions occur, and what happens when they do occur).
IEEE 754 specifies four formats for representing floating-point values: single-precision (32-bit), double-precision (64-bit), single-extended precision (≥ 43-bit, not commonly used) and double-extended precision (≥ 79-bit, usually implemented with 80 bits). Only 32-bit values are required by the standard; the others are optional. Many languages specify that IEEE formats and arithmetic be implemented, although sometimes it is optional. For example, the C programming language, which pre-dated IEEE 754, now allows but does not require IEEE arithmetic (the C float typically is used for IEEE single-precision and double uses IEEE double-precision).
The full title of the standard is IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985), and it is also known as IEC 60559:1989, Binary floating-point arithmetic for microprocessor systems (originally the reference number was IEC 559:1989).*
Following is a description of the standards' format for floating-point numbers.
Bits within a word of width W are indexed by integers in the range 0 to W−1 inclusive. The bit with index 0 is drawn on the right. The lowest indexed bit is usually the least significant.
Binary floating-point numbers are stored in a sign-magnitude form as follows:
where the most significant bit is the sign bit, exponent is the biased exponent, and mantissa is the significand minus the most significant bit.
The exponent is biased by . Biasing is done because exponents have to be signed values in order to be able to represent both tiny and huge values, but two's complement, the usual representation for signed values, would make comparison harder. To solve this the exponent is biased before being stored, by adjusting its value to put it within an unsigned range suitable for comparison.
For example, to represent a number which has exponent of 17, exponent is .
The most significant bit of the mantissa is determined by the value of exponent. If , the most significant bit of the mantissa is 1, and the number is said to be normalized. If exponent is 0, the most significant of the mantissa is 0 and the number is said to be de-normalized. Three special cases arise:
This can be summarized as:
| Type | Exp | Fraction |
|---|---|---|
| Zeroes | 0 | 0 |
| Denormalised numbers | 0 | non zero |
| Normalised numbers | any | |
| Infinities | 0 | |
| NaN | non zero | |
A single-precision binary floating-point number is stored in a 32-bit word:
The exponent is biased by in this case, so that exponents in the range −126 to +127 are representable. An exponent of −127 would be biased to the value 0 but this is reserved to encode that the value is a denormalized number or zero. An exponent of 128 would be biased to the value 255 but this is reserved to encode an infinity or not a number.
For normalised numbers, the most common, Exp is the biased exponent and Fraction is the fractional part of the significand. The number has value v:
v = s × 2e × m
Where
s = +1 (positive numbers) when the sign bit is 0
s = −1 (negative numbers) when the sign bit is 1
e = Exp − 127 (in other words the exponent is stored with 127 added to it, also called "biased with 127")
m = 1.Fraction in binary (that is, the significand is the binary number 1 followed by the radix point followed by the binary bits of Fraction). Therefore, 1 ≤ m < 2.
Notes:
Let us encode the decimal number −118.625 using the IEEE 754 system.
Putting them all together:
float arrayToFloat(unsigned char data*) { int s, e; unsigned long src; long f; float value; src = ((unsigned long)data* << 24) + ((unsigned long)data* << 16) + ((unsigned long)data* << 8) + ((unsigned long)data*); s = (src & 0x80000000UL) >> 31; e = (src & 0x7F800000UL) >> 23; f = (src & 0x007FFFFFUL); if (e == 255 && f != 0) { /* NaN - Not a number */ value = DBL_MAX; } else if (e
Double precision is essentially the same except that the fields are wider:
The mantissa is much larger, while the exponent is only slightly larger. This is because precision is more valued than range, according to the creators of the standard.
NaNs and Infinities are represented with Exp being all 1s (2047).
For Normalised numbers the exponent bias is +1023 (so e is Exp − 1023). For Denormalised numbers the exponent is −1022 (the minimum exponent for a normalised number—it is not −1023 because normalised numbers have a leading 1 digit before the binary point and denormalised numbers do not). As before, both infinity and zero are signed.
Notes:
Comparing floating-point numbers is usually best done using floating-point instructions. However, this representation makes comparisons of some subsets of numbers possible on a byte-by-byte basis, if they share the same byte order and the same sign, and NaNs are excluded.
For example, for two positive numbers a and b, then a < b is true whenever the unsigned binary integers with the same bit patterns and same byte order as a and b are also ordered a < b. In other words, two positive floating-point numbers (known not to be NaNs) can be compared with an unsigned binary integer comparison using the same bits, providing the floating-point numbers use the same byte order (this ordering, therefore, cannot be used in portable code through a union in the C programming language). This is an example of lexicographic ordering.
The IEEE standard has four different rounding modes.
Note that the IEEE 754 standard is currently under revision. See: IEEE 754r
Computer arithmetic | IEEE standards
IEEE 754 | IEEE punto flotante | IEEE 754 | IEEE 754 | IEEE 754 | IEEE754 | IEEE 754
This article is licensed under the GNU Free Documentation License.
It uses material from the
"IEEE floating-point standard".
Home Page • arts • business • computers • games • health • hospitals • home • kids & teens • news • physicians • recreation• reference • regional • science • shopping • society • sports • world