The term computer numbering formats refers to the schemes implemented in digital computer and calculator hardware and software to represent numbers. A common mistake made by non-specialist computer users is a certain faith in the infallibility of numerical computations.
For example, if one multiplies: one might perhaps expect to get a result of exactly 1, which is the correct answer when applying an exact rational number or algebraic model. In practice, however, the result on a digital computer or calculator may prove to be something such as 0.9999999999999999 (as one might find when doing the calculation on paper) or, in certain cases, perhaps 0.99999999923475.
The latter result seems to indicate a bug, and it is a surprise to many that this is the way it is designed to work if you use a binary floating-point approximation. Decimal floating-point, computer algebra systems, and certain bignum systems would give either the answer of 1 or 0.9999999999999999...
Almost all computer users understand the concept of a bit (that is, a 1 or 0 value encoded by the setting of a switch of some kind). A single bit can represent two states: 0 1
Therefore, if you take two bits, you can use them to represent four unique states:
00 01 10 11
And, if you have three bits, then you can use them to represent eight unique states:
000 001 010 011 100 101 110 111
With every bit you add, you double the number of states you can represent. Therefore, the expression for the number of states with n bits is 2n. Most computers operate on information in groups of 8 bits, or some other power of two bits (such as 16, 32, or 64 bits) at a time. A group of 8 bits is now widely used as a fundamental unit, and has been given the name of octet. A computer's smallest addressable memory unit (a byte) is typically an octet, so the word byte is now generally understood to mean an octet.
A unit of four bits, or half an octet, is often called a nibble (or nybble). It can encode 16 different values, such as the numbers 0 to 15. Any arbitrary sequence of bits could be used in principle, but in practice the most common scheme is:
0000 = decimal 00 1000 = decimal 08 0001 = decimal 01 1001 = decimal 09 0010 = decimal 02 1010 = decimal 10 0011 = decimal 03 1011 = decimal 11 0100 = decimal 04 1100 = decimal 12 0101 = decimal 05 1101 = decimal 13 0110 = decimal 06 1110 = decimal 14 0111 = decimal 07 1111 = decimal 15
This order (rather than gray code) is used because it is a positional notation, like the decimal notation that humans are more used to. For example, given the decimal number:
is commonly interpreted as:
or, using powers-of-10 notation:
(Note that any non-zero number to the power zero is 1.)
Each digit in the number represents a value from 0 to 9 (hence ten different possible values) which is why this is called a decimal or base-10 number. Each digit also has a weight of a power of ten associated with its position.
Similarly, in the binary number encoding scheme mentioned above, the (decimal) value 13 is encoded as:
1101
Each bit can only have a value of 1 or 0 (hence only two possible values) so this is a binary, or base-2 number. Accordingly, the positional weighting is as follows:
1101 = (1 × 23) + (1 × 22) + (0 × 21) + (1 × 20) = (1 × 8) + (1 × 4) + (0 × 2) + (1 × 1) = 13 decimal
Notice the values of powers of 2 used here: 1, 2, 4, 8. Experienced computer programmers generally know the powers of 2 up to the 16th power because they use them often:
20 = 1 28 = 256 21 = 2 29 = 512 22 = 4 210 = 1,024 23 = 8 211 = 2,048 24 = 16 212 = 4,096 25 = 32 213 = 8,192 26 = 64 214 = 16,384 27 = 128 215 = 32,768 216 = 65,536
Sometimes, in this context (and unlike the SI International System of Units) the value 210 = 1,024 is referred to as Kilo, or simply K (sometimes referred to as Kibibyte), so any higher powers of 2 are often conveniently referred to as multiples of that value:
211 = 2 K = 2,048 214 = 16 K = 16,384 212 = 4 K = 4,096 215 = 32 K = 32,768 213 = 8 K = 8,192 216 = 64 K = 65,536
Similarly, the value 220 = 1,024 × 1,024 = 1,048,576 is referred to as a Meg, or simply M (sometimes referred to as Mebibyte):
221 = 2 M 222 = 4 M
and the value 230 is referred to as a Gig, or simply G (sometimes referred to as Gibibyte).
However, in December 1998 the International Electrotechnical Commission produced new units for these power-of-two values, in order to bring prefixes such as kilo- and mega- back to their SI definitions. (See Binary prefix.)
(There is another subtlety in this discussion. If we use 16 bits, we can have 65,536 different values, but the values are from 0 to 65,535. Humans start counting at one, machines start counting from zero, since it is easier to program them this way. This detail often confuses.)
The binary scheme just outlined defines a simple way to count with bits, but it has a few restrictions:
Despite these limitations, such unsigned integer numbers are very useful in computers for counting things one-by-one. They are very simple for the computer to manipulate.
See also Base 64.
Octal and hex are a convenient way to represent binary numbers, as used by computers. Computer mechanics often need to write out binary quantities, but in practice writing out a binary number such as 1001001101010001 is tedious, and prone to errors. Therefore, binary quantities are written in a base-8 ("octal") or, much more commonly, a base-16 ("hexadecimal" or "hex") number format.
In the decimal system, there are 10 digits (0 through 9) which combine to form numbers as follows:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 ...
In an octal system, there are only 8 digits (0 through 7):
0 1 2 3 4 5 6 7 10 11 12 13 14 15 16 17 20 21 22 23 24 25 26 ...
That is, an octal "10" is the same as a decimal "8", an octal "20" is a decimal 16, and so on.
In a hex system, there are 16 digits (0 through 9 followed, by convention, with A through F):
0 1 2 3 4 5 6 7 8 9 A B C D E F 10 11 12 13 14 15 16 ...
That is, a hex "10" is the same as a decimal "16" and a hex "20" is the same as a decimal "32".
octal 756 = (7 × 82) + (5 × 81) + (6 × 80) = (7 × 64) + (5 × 8) + (6 × 1) = 448 + 40 + 6 = decimal 494
hex 3b2 = (3 × 162) + (11 × 161) + (2 × 160) = (3 × 256) + (11 × 16) + (2 × 1) = 768 + 176 + 2 = decimal 946
Thus, an octal digit has a perfect correspondence to a 3-bit binary value number: 000 = octal 0 001 = octal 1 010 = octal 2 011 = octal 3 100 = octal 4 101 = octal 5 110 = octal 6 111 = octal 7
Similarly, a hex digit has a perfect correspondence to a 4-bit binary number: 0000 = hex 0 1000 = hex 8 0001 = hex 1 1001 = hex 9 0010 = hex 2 1010 = hex a 0011 = hex 3 1011 = hex b 0100 = hex 4 1100 = hex c 0101 = hex 5 1101 = hex d 0110 = hex 6 1110 = hex e 0111 = hex 7 1111 = hex f
So it is easy to convert a long binary number, such as 1001001101010001, to octal:
001 001 001 101 010 001 binary = 1 1 1 5 2 1 111521 octal
and easier to convert that number to hex:
1001 0011 0101 0001 binary = 9 3 5 1 9351 hexadecimal
but it is harder to convert it to decimal (37713).
Conversion of numbers from hex or octal to decimal can also be done by using the following pattern.
(d1 * base + d2) * base + dn........
Where the first digit in the number is multiplied by the numbers base and added to the second digit. To convert numbers with three digits or more the pattern is just continued.
Examples of this are shown below.
hex A1
d1=A(or decimal 10) d2=1 base=16
d1 * base + d2
10 * 16 + 1= decimal 161
hex 129
d1=1 d2=2 d3=9 base=16
(d1 * base + d2) * base + d3= (1 * 16 + 2) * 16 + 9= decimal 297
The same method can be applied to conversion of octal and binary numbers:
binary 1011
d1=1 d2=0 d3=1 d4=1 base=2
((d1 * base + d2) * base + d3) * base + d4=
((1 * 2 + 0) * 2 + 1) * 2 + 1= decimal 11
octal 1232
d1=1 d2=2 d3=3 d4=2 base=8
(((d1 * base) + d2) * base + d3) * base + d4=
((1 * 8 + 2) * 8 + 3) * 8 + 2= decimal 666
Binary numbers have no inherent way to representing negative numbers in a computer. In order to create these "signed integers" a few different systems have been developed. In each, a special bit is set aside as the "sign bit", which is usually the leftmost (most significant) bit. If the sign bit is 1 the number is negative; if 0, positive.
A side effect of both this and the previous system is that there are two representations for zero, one of the reasons this system is very good for computing: 0000 = +0 1111 = -0
Thus: 0000 = decimal 0 1000 = decimal -8 0001 = decimal 1 1001 = decimal -7 0010 = decimal 2 1010 = decimal -6 0011 = decimal 3 1011 = decimal -5 0100 = decimal 4 1100 = decimal -4 0101 = decimal 5 1101 = decimal -3 0110 = decimal 6 1110 = decimal -2 0111 = decimal 7 1111 = decimal -1
Using this system, 16 bits will encode numbers from -32,768 to 32,767, while 32 bits will encode -2,147,483,648 to 2,147,483,647.
Fixed-point formats are often used in business calculations (such as with spreadsheets or COBOL), where floating-point with insufficient precision is unacceptable when dealing with money. It is helpful to study it to see how fractions can be stored in binary.
A number of bits sufficient for the precision and range required must be chosen to store the fractional and integer parts of a number. For example, using a 32-bit format, 16 bits might be used for the integer and 16 for the fraction.
The fractional bits continue the pattern set by the integer bits: if the eight's bit is followed by the four's bit, then the two's bit, then the one's bit, then of course the next bit is the half's bit, then the quarter's bit, then the 1/8's bit, et cetera.
Examples: integer bits fractional bits 0.5 = 1/2 = 00000000 00000000.10000000 00000000 1.25 = 1 1/4 = 00000000 00000001.01000000 00000000 7.375 = 7 3/8 = 00000000 00000111.01100000 00000000
However, using this form of encoding means that some numbers cannot be represented in binary. For example, for the fraction 1/5 (in decimal, this is 0.2), the closest one can get is:
13107 / 65536 = 00000000 00000000.00110011 00110011 = 0.1999969... in decimal 13108 / 65536 = 00000000 00000000.00110011 00110100 = 0.2000122... in decimal
And even with more digits, an exact representation is impossible. Consider the number 1/3. If you were to write the number out as a decimal (0.333333...) it would continue indefinitely. If you were to stop at any point, the number written would not exactly represent the number 1/3.
The point is: some fractions cannot be expressed exactly in binary notation... not unless you use a special trick. The trick is, to store a fraction as two numbers, one for the numerator and one for the denominator, and then use arithmetic to add, subtract, multiply, and divide them. However, arithmetic will not let you do higher math (such as square roots) with fractions, nor will it help you if the lowest common denominator of two fractions is too big a number to handle. This is why there are advantages to using the fixed-point notation for fractional numbers.
While both unsigned and signed integers are used in digital systems, even a 32-bit integer is not enough to handle all the range of numbers a calculator can handle, and that's not even including fractions. To approximate the greater range and precision of real numbers we have to abandon signed integers and fixed-point numbers and go to a "floating-point" format.
In the decimal system, we are familiar with floating-point numbers of the form:
or, more compactly:
1.1030402E5
which means "1.103402 times 1 followed by 5 zeroes". We have a certain numeric value (1.1030402) known as a "significand", multiplied by a power of 10 (E5, meaning 105 or 100,000), known as an "exponent". If we have a negative exponent, that means the number is multiplied by a 1 that many places to the right of the decimal point. For example:
The advantage of this scheme is that by using the exponent we can get a much wider range of numbers, even if the number of digits in the significand, or the "numeric precision", is much smaller than the range. Similar binary floating-point formats can be defined for computers. There are a number of such schemes, the most popular has been defined by IEEE (Institute of Electrical & Electronic Engineers, a US professional and standards organization). The IEEE 754 standard specification defines a 64 bit floating-point format with:
Let's see what this format looks like by showing how such a number would be stored in 8 bytes of memory:
byte 0: S x10 x9 x8 x7 x6 x5 x4 byte 1: x3 x2 x1 x0 m51 m50 m49 m48 byte 2: m47 m46 m45 m44 m43 m42 m41 m40 byte 3: m39 m38 m37 m36 m35 m34 m33 m32 byte 4: m31 m30 m29 m28 m27 m26 m25 m24 byte 5: m23 m22 m21 m20 m19 m18 m17 m16 byte 6: m15 m14 m13 m12 m11 m10 m9 m8 byte 7: m7 m6 m5 m4 m3 m2 m1 m0
where "S" denotes the sign bit, "x" denotes an exponent bit, and "m" denotes a significand bit. Once the bits here have been extracted, they are converted with the computation:
This scheme provides numbers valid out to about 15 decimal digits, with the following range of numbers:
| maximum | minimum | |
|---|---|---|
| positive | 1.797693134862231E+308 | 4.940656458412465E-324 |
| negative | -4.940656458412465E-324 | -1.797693134862231E+308 |
The spec also defines several special values that are not defined numbers, and are known as NaNs, for ‘Not A Number’. These are used by programs to designate invalid operations and the like. You will rarely encounter them and NaNs will not be discussed further here. Some programs also use 32-bit floating-point numbers. The most common scheme uses a 23-bit significand with a sign bit, plus an 8-bit exponent in "excess-127" format, giving 7 valid decimal digits.
byte 0: S x7 x6 x5 x4 x3 x2 x1 byte 1: x0 m22 m21 m20 m19 m18 m17 m16 byte 2: m15 m14 m13 m12 m11 m10 m9 m8 byte 3: m7 m6 m5 m4 m3 m2 m1 m0
The bits are converted to a numeric value with the computation:
leading to the following range of numbers:
| maximum | minimum | |
|---|---|---|
| positive | 3.402823E+38 | 2.802597E-45 |
| negative | -2.802597E-45 | -3.402823E+38 |
Such floating-point numbers are known as "reals" or "floats" in general, but with a number of inconsistent variations, depending on context:
A 32-bit float value is sometimes called a "real32" or a "single", meaning "single-precision floating-point value".
A 64-bit float is sometimes called a "real64" or a "double", meaning "double-precision floating-point value".
The term "real" without any elaboration generally means a 64-bit value, while the term "float" similarly generally means a 32-bit value.
Once again, remember that bits are bits. If you have 8 bytes stored in computer memory, it might be a 64-bit real, two 32-bit reals, or 4 signed or unsigned integers, or some other kind of data that fits into 8 bytes.
The only difference is how the computer interprets them. If the computer stored four unsigned integers and then read them back from memory as a 64-bit real, it almost always would be a perfectly valid real number, though it would be junk data.
So now our computer can handle positive and negative numbers with fractional parts. However, even with floating-point numbers you run into some of the same problems that you did with integers:
Low-level programmers have to worry about unsigned and signed, fixed and floating-point numbers. They have to write wildly different code, with different opcodes and operands, to add two floating point numbers compared to the code to add two integers.
However, high-level programming languages such as LISP and Python offer an abstract number that may be an expanded type such as rational, bignum, or complex. Programmers in LISP or Python (among others) have some assurance that their program code will Do The Right Thing with mathematical operations. Due to operator overloading, mathematical operations on any number — whether signed, unsigned, rational, floating-point, fixed-point, integral, or complex — are written exactly the same way. Others languages such as REXX and Java provide decimal floating-point which avoids many 'unexpected' results.
This article is licensed under the GNU Free Documentation License.
It uses material from the
"Computer numbering formats".
Home Page • arts • business • computers • games • health • hospitals • home • kids & teens • news • physicians • recreation• reference • regional • science • shopping • society • sports • world