For the IEEE binary floating-point standard, see its page.
Floating-point is a way of representing a subset of the real numbers in terms of an exponent, which is an integer, and a significand, which is a string of some chosen number of digits. Floating-point representation is very common because of computers, but it has its roots in scientific notation.
As an example, the revolution period of Jupiter's moon Io could be represented in scientific notation as 1.528535047 × 105 seconds. The string of digits "1528535047" is the significand, and the exponent is 5.
Now this could be represented as any of
The main benefit of scientific notation is that it makes representations like the last one unnecessary, by allowing the decimal point to be put in a convenient place. True floating-point notation uses a precise specification that the point is always just to the right of the leftmost digit of the significand, so the correct representation is
This, plus the requirement that the leftmost digit of the significand be nonzero, is called normalization. By doing this, one no longer needs to say where the point is; it is deduced from the exponent. In decimal floating-point notation with precision of 10, the revolution period of Io is simply
When a (nonzero) floating-point number is normalized, its leftmost digit is nonzero. The value of the significand obeys 1 ≤ s < b. (Zero can't be represented in normalized floating-point notation, and computers have to deal with it separately.)
The mathematical value of a floating-point number is
In binary radix, the significand is a string of bits (1's and 0's) of length p, of which the leftmost bit is 1. The number π, represented in binary, is
(There is no reason to write the exponent in any form other than a plain decimal number. When floating-point numbers are stored in a computer, a suitable binary encoding is used.)
However, there is a common belief that
Double as well as single precision floating point representations, in fact, carry extreme amounts of precision (even if we accept them as approximations). However, repeated operations of multiplication, division, addition, and, most importantly of all, subtraction, accumulate very large amounts of error. It is these floating point arithmetic operations that are the real source of imprecision, and it is they that make the very large representations necessary and even insufficient.
For example, one million repeated multiplications of numbers which have 7 significant decimal digits (ie IEEE single precision) will, on average, produce a final result accurate to just 4. In the worst case, a result may be accurate to just 1 significant decimal digit. Worst of all, it will be impossible to tell just how accurate the result turned out to be! Given that a modern processor can compute in the magnitude of 10 billion floating point operations per second, even double-precision numbers may not be safe in regard to multiplication, division, and addition.
However, the worst arithmetic operation to give imprecision is subtraction between two numbers of near-equal value. For example, 1.234567e0 - 1.234888e0 results, in case of decimal representations, in -3.210000e-4. Each one of those zeroes represents a digit whose value is unknown. In just a single operation, the precision of arithmetic has been halfed!
Other operations, such as powers, square roots, trigonometric function, etc. carry various amounts of error and risk.
Vital in avoiding the pitfalls of floating point numbers is the choice of algorithm. Different algorithms, computing the same problem, may have various degrees of numerical instability.
Lastly, there is a very important caveat in trying to employ floating-point numbers in "infinitely exact" arithmetic (although they are as capable of it as integers). It stems from the fact that floating point numbers are in fact binary fractions, while we commonly use decimals. Thus, 0.5 multiplied by 0.5 will exactly equal 0.25, with no caveats or gotchas. Comparisons for equality may be safely made, just as one is accustomed to do with integers. However, 0.1*0.1 will not equal 0.01, primarily because there isn't actually a way of representing 0.1 or 0.01 exactly as a binary fraction. Nor will the 0.1 approximation multiplied by itself equal the 0.01 approximation. Note, this is like adding 1/3 (0.333) to itself and getting 0.666 instead of 0.667 (2/3). Thus, in most cases, if one is trying to perform comparisons of arithmetic results, even those which seemingly are safely bound by the precision of the format, one must take the absolute value of the difference between them and compare it against a small number. (ie, in C, instead of if(a==b), write if(abs(a-b)<0.00001)).
(This section describes some general issues, but mostly follows the IEEE standard.)
To represent a floating-point number in a computer datum, the exponent has to be encoded into a bit field. Since the exponent could be negative, one could use two's complement representation. Instead, what is commonly done is something very similar: a fixed constant is added to the exponent, with the intention that the result be a positive number capable of being packed into a fixed bit field. For the common 32 bit "single precision" or "float" format of the IEEE standard, this constant is 127, so the exponent is said to be represented in "excess 127" format. The result of this addition is placed in an 8 bit field.
Since the leftmost significand bit of a (normalized) floating-point number is always 1, that bit is not actually placed in the computer datum. The computer's arithmetic hardware acts as though that "1" had been provided. This is the "implicit bit" or "hidden bit" of the IEEE standard. Because of this, a 24 bit significand is actually placed in a 23 bit field.
Finally, a sign bit is required. This is one to indicate that the entire floating-point number is negative, or zero to indicate that it is positive. (In the past, some computers have used a kind of two's complement encoding for the entire number, rather than simple "sign/magnitude" format, but it didn't work very well.)
The entire floating-point number is packed into a 32 bit word, with the sign bit leftmost, followed by the exponent in excess 127 format in the next 8 bits, followed by the significand (without the hidden bit) in the rightmost 23 bits.
For the approximation to π, we have
In the common 64 bit "double precision" or "double" format of the IEEE standard, the offset added to the exponent is 1023, and the result is placed into an 11 bit field. The precision is 53 bits. After removal of the hidden bit, 52 bits remain. The result comprises 1+11+52=64 bits. The approximation to π is
The necessity to pack the offset exponent into a fixed-size bit field places limits on the exponent. For the standard 32 bit format, e+127 must fit into an 8 bit field, so -127 ≤ e ≤ 128. The values -127 and +128 are reserved for special meanings, so the actual range for normalized floating-point numbers is -126 ≤ e ≤ 127. This means that the smallest normalized number is
Any floating-point computation that gives a result (after rounding to a representable value) higher than the upper limit is said to overflow. Under the IEEE standard, such result is set to a special value "infinity", which has the appropriate sign bit, the reserved exponent +128, and a bit pattern in the significand (typically zero) to indicate that this is infinity. Such numbers are generally printed as "+INF" or "-INF".
Floating-point hardware is generally designed to handle operands of infinity in a reasonable way, such as
A floating-point computation that (after rounding) gives a nonzero result lower than the lower limit is said to underflow. This could happen, for example, if 10-25 is multiplied by 10-25 in single precision. Under the IEEE standard, the reserved exponent -127 is used, and the significand is set as follows:
First, if the number is zero, it is represented by an exponent of -127 and a significand field of all zeros. This means that zero is represented in hexadecimal as 00000000.
Otherwise, if normalizing the number would lead to an exponent of -127 or less, it is only normalized until the exponent is -127. That is, instead of shifting the significand bits left until the leftmost bit is 1, they are shifted until the exponent reaches -127. For example, the smallest non-underflowing number is
The smallest possible (subnormal) nonzero number is
The handling of the number zero can be seen to be a completely ordinary case of a subnormal number.
The creation of denorms is often called "gradual underflow". As numbers get extremely small, significand bits are slowly sacrificed. The alternative is "sudden underflow", in which any number that can't be normalized is simply set to zero by the computer hardware. Gradual underflow is difficult for computer hardware to handle, so hardware often uses software to assist it, through interrupts. This creates a serious performance penalty, and sometimes sudden underflow is used instead. This is a source of controversy.
The mandated behavior for dealing with overflow and underflow is that the appropriate result is computed, taking the rounding mode into consideration, as though the exponent range were infinitely large. If that resulting exponent can't be packed into its field correctly, the overflow/underflow action described above is taken.
The arithmetical distance between two consecutive representable floating point numbers is called an "ULP", for Unit in the Last Place. For example, the numbers represented by 45670123 and 45670124 hexadecimal is one ULP. An ULP is about 10-7 in single precision, and 10-16 in double precision. The mandated behavior of IEEE-compliant hardware is that the result be within 1/2 of an ULP.
The intention of the INF and NaN values is that, under the most common circumstances, they can just propagate from one operation to the next (any operation with NaN as an operand produces NaN as a result), and they only need to be attended to at a point that the programmer chooses.
In addition to the creation of exceptional values, there are "events" that may occur, though some of them are quite benign:
Computer hardware is typically able to raise exceptions ("traps") when these events occur. How this is done is very system-dependent. Usually all exceptions are masked (disabled). Sometimes overflow, zerodivide, and operand error are enabled.
While floating point addition and multiplication are commutative, they are not associative. This means that in general for floating point numbers x, y, and z:
Floating point arithmetic is also not distributive. This means that in general:
For example, with most floating-point implementations
The reason is that, in the first case, the initial subtraction gives zero, since that is the true result and is representable, and 1 can be added to it successfully. In the second case, the floating point approximation to 1e100, when 1 is added, results in the same approximation, so the subtraction yields zero.
In short, the order in which operations are carried out can change the output of a floating point calculation. This is important in numerical analysis since two mathematically equivalent formulas may not produce the same numerical output, and one may be substantially more accurate than the other.
In addition to careful design of programs, careful handling by the compiler is essential. Certain "optimizations" that compilers might make (for example, reordering operations) will lead to errors. There is much controversy about the failings of compilers and language designs in this area.
Floating point arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of Io or the mass of the proton), and at its worst when it is expected to model quantities expressed as decimal strings that are expected to be exact. An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation. See: http://www2.hursley.ibm.com/decimal/ The "decimal" data type of the C# and Java programming languages, and the IEEE 854 standard, are designed to avoid the problems of binary floating point, and make the arithmetic always behave as expected when numbers are printed in decimal.
The general types of things that can cause trouble are:
Unlike the fixed-point counterpart, the application of dither in a floating point environment is nearly impossible. See external references for more information about the difficulty of applying dither and the rounding error problems in floating point systems.
The IEEE has standardized the computer representation for binary floating-point numbers in IEEE 754. This standard is followed by almost all modern machines. Notable exceptions include IBM Mainframes, which support IBM's own format (in addition to IEEE 754 data types), and Cray vector machines, where the T90 series had an IEEE version, but the SV1 still uses Cray floating-point format.
The standard allows for many different precision levels, of which the 32 bit ("single") and 64 bit ("double") are by far the most common, since they are supported in common programming languages. Computer hardware (for example, the Intel Pentium series and the Motorola 68000 series) often provides an 80 bit format, with 15 exponent bits and 64 significand bits, with no hidden bit. Software vendors may also provide additional extended formats, such as the H-P "quad" format (1 sign bit, 15 exponent bits, and 113 significand bits, 1 of which is hidden.) There is controversy about the failure of programming languages to make these hardware formats available to programmers.
As of 2000, the IEEE 754 standard is currently under revision. See: IEEE 754r
Stage Sides Inscribed (2I + C)/3 Circumscribed 1 6 3.00000000000000000 3.15470053837925153 3.46410161513775459 2 12 3.10582854123024915 3.14234913054465692 3.21539030917347248 3 24 3.13262861328123820 3.14163905621999229 3.15965994209750048 4 48 3.13935020304686720 3.14159554040838980 3.14608621513143498 5 96 3.14103195089050961 3.14159283380879585 3.14271459964536834 6 192 3.14145247228546194 3.14159266485024924 3.14187304997982385 7 384 3.14155760791185764 3.14159265429352108 3.14166274705684795 8 768 3.14158389214831805 3.14159265363377626 3.14161017660469268 9 1536 3.14159046322805003 3.14159265359253476 3.14159703432150421 10 3072 3.14159210599928383 3.14159265358995624 3.14159374877130105 11 6144 3.14159251669204966 3.14159265358971935 3.14159292738505874 12 12288 3.14159261936512077 3.14159265358966492 3.14159272203875323 13 24576 3.14159264503460936 3.14159265358907182 3.14159267069799674 14 49152 3.14159265144937596 3.14159265359313918 3.14159265788066561 15 98304 3.14159265303352628 3.14159265355780571 3.14159265460636456 16 196608 3.14159265332534345 3.14159265356534707 3.14159265404535431 17 393216 3.14159265332534345 3.14159265308533982 3.14159265260533258 18 786432 3.14159265332534345 3.14159265178665197 3.14159264870926902 19 1572864 3.14159265599338609 3.14159265130664472 3.14159264193316198 20 3145728 3.14159264532121550 3.14159264645056668 3.14159264870926902 21 6291456 3.14159264532121550 3.14159264419186433 3.14159264193316198 22 12582912 3.14159298683065634 3.14159264645055406 3.14159196569034949 23 25165824 3.14159503588652176 3.14159423790206850 3.14159264193316198 24 50331648 3.14159776795893004 3.14159219110183233 3.14158103738763692
Second, form b) for both: Stage Sides Inscribed (2I + C)/3 Circumscribed 1 6 3.00000000000000000 3.15470053837925153 3.46410161513775459 2 12 3.10582854123024915 3.14234913054465692 3.21539030917347248 3 24 3.13262861328123820 3.14163905621999229 3.15965994209750048 4 48 3.13935020304686721 3.14159554040838980 3.14608621513143497 5 96 3.14103195089050964 3.14159283380879586 3.14271459964536830 6 192 3.14145247228546208 3.14159266485024934 3.14187304997982387 7 384 3.14155760791185765 3.14159265429352127 3.14166274705684853 8 768 3.14158389214831841 3.14159265363377545 3.14161017660468954 9 1536 3.14159046322805010 3.14159265359254211 3.14159703432152615 10 3072 3.14159210599927155 3.14159265358996504 3.14159374877135203 11 6144 3.14159251669215745 3.14159265358980398 3.14159292738509703 12 12288 3.14159261936538396 3.14159265358979391 3.14159272203861382 13 24576 3.14159264503369090 3.14159265358979328 3.14159267070199805 14 49152 3.14159265145076765 3.14159265358979324 3.14159265786784442 15 98304 3.14159265305503684 3.14159265358979324 3.14159265465930603 16 196608 3.14159265345610414 3.14159265358979324 3.14159265385717144 17 393216 3.14159265355637096 3.14159265358979324 3.14159265365663779 18 786432 3.14159265358143767 3.14159265358979324 3.14159265360650438 19 1572864 3.14159265358770435 3.14159265358979324 3.14159265359397102 20 3145728 3.14159265358927102 3.14159265358979324 3.14159265359083769 21 6291456 3.14159265358966268 3.14159265358979324 3.14159265359005435 22 12582912 3.14159265358976060 3.14159265358979324 3.14159265358985852 23 25165824 3.14159265358978508 3.14159265358979324 3.14159265358980956 24 50331648 3.14159265358979120 3.14159265358979324 3.14159265358979732
4*atan(1) = 3.14159265358979324 Good digits: 3.14159265358979323846264338327950288
Data types | Computer arithmetic
Gleitkommazahl | Coma flotante | Virgule flottante | 부동소수점 | Virgola mobile | נקודה צפה | Real (informatica) | 浮動小数点数 | Liczba zmiennoprzecinkowa | Vírgula flutuante | Pohyblivá desatinná čiarka | Liukuluku | Flyttal | 浮点数
This article is licensed under the GNU Free Documentation License.
It uses material from the
"Floating point".
Home Page • arts • business • computers • games • health • hospitals • home • kids & teens • news • physicians • recreation• reference • regional • science • shopping • society • sports • world