The international standard ISO/IEC 10646 defines the Universal Character Set (UCS) as a character encoding. It contains nearly a hundred thousand abstract characters, each identified by an unambiguous name and an integer number called its code point.
Characters, symbols, glyphs, letters, numbers, ideograms, logograms, etc are taken from all unique and different languages, scripts, traditions from all around the world, and then they are collectively placed into the UCS. Rest of the characters from the (less known) remaining writing systems are also getting added or updated frequently into the UCS.
Since 1991, the Unicode Consortium has worked with ISO to develop The Unicode Standard ("Unicode") and ISO/IEC 10646 in tandem. The repertoire, character names, and code points of Version 2.0 of Unicode exactly match those of ISO/IEC 10646-1:1993 with its first seven published amendments. After the publication of Unicode 3.0 in February 2000, corresponding new and updated characters entered the UCS via ISO/IEC 10646-1:2000.
The UCS has over 1.1 million code points, but only the first 65,536 (the Basic Multilingual Plane, or BMP) had entered into common use before 2000. This situation began changing when the People's Republic of China (PRC) mandated in 2000 that computer systems sold in its territory must support GB18030, which required that computer systems intended for sale in the PRC must move beyond the BMP.
The system deliberately leaves many code points not assigned to characters, even in the BMP. It does this to allow for future expansion or to minimize conflicts with other encoding forms.
The first amendment to the original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP. A range of code points in the S (Special) Zone of the BMP remains unassigned to characters. UCS-2 disallows use of code values for these code points, but UTF-16 allows their use in pairs. Each pair consists of an "RC-element" (a two-octet sequence comprising the R-octet and the C-octet from the four octet sequence that corresponds to a cell in the coding space of a coded character set) from the high-half zone and an "RC-element" from the low-half zone. Unicode also adopted UTF-16, but in Unicode terminology, the high-half zone elements become "high surrogates" and the low-half zone elements become "low surrogates".
Another encoding, UCS-4, uses a single code value between 0 and (theoretically) hexadecimal 7FFFFFFF for each character (although the UCS stops at 10FFFF and ISO/IEC 10646 has stated that all future assignments of characters will also take place in that range). UCS-4 allows representation of each value as exactly four bytes (one 32-bit word). UCS-4 thereby permits a binary representation of every code point in the UCS, including those outside the BMP. As in UCS-2, every encoded character has a fixed length in bytes, which makes it simple to manipulate, but of course it requires twice as much storage as UCS-2.
Occasionally, articles about Unicode will mistakenly refer to UCS-2 as "UCS-16". UCS-16 does not exist; the authors who make this error usually intend to refer to UCS-2 or to UTF-16.
One could code the characters of this primordial ISO 10646 standard in one of three ways:
In 1990, therefore, two initiatives for a universal character set existed: Unicode, with 16 bits for every character (65,536 possible characters), and ISO 10646. The software companies refused to accept the complexity and size requirement of the ISO standard and were able to convince a number of ISO National Bodies to vote against it. The ISO standardisers realised they could not continue to support the standard in its current state and negotiated the unification of their standard with Unicode. Two changes took place: the lifting of the limitation upon characters (prohibition of control character values), thus permitting characters like 0x0000101F; and the synchronisation of the repertoire of the Basic Multilingual Plane with that of Unicode.
Meanwhile, in the passage of time, the situation changed in the Unicode standard itself: 65,536 characters came to appear insufficient, and the standard from version 2.0 and onwards supports encoding of 1,112,064 characters by means of the UTF-16 surrogate mechanism. For that reason, ISO 10646 was limited to contain as many characters as could be encoded by UTF-16 and no more, that is, a little over a million characters instead of over 2,000 million. The UCS-4 encoding of ISO 10646 was incorporated into the Unicode standard with the limitation to the UTF-16 range and under the name UTF-32. As for UTF-1, no-one used it, because of its bad design (no way of distinguishing between single bytes, lead bytes and trail bytes, a problem similar to that of the Shift-JIS encoding of Japanese) and its poor performance (many division operations). Rob Pike and Ken Thompson, the designers of the Plan 9 operating system, devised a new, fast and well-designed mixed width encoding, which came to be called UTF-8.
Some applications support ISO 10646 characters but do not fully support Unicode. One such application, Linux xterm, can properly display all ISO 10646 characters that have a one-to-one character-to-glyph mapping and a single directionality. It can handle some combining marks by simple overstriking methods, but cannot display Hebrew (bidirectional), Devanagari (one character to many glyphs) or Arabic (both features). Most GUI applications use standard OS text drawing routines which handle such scripts, although the applications themselves still do not always handle them correctly. For instance, selecting text in certain scripts in Mozilla Firefox causes the text to jump around.
See §D.1 of The Unicode Standard for more detail.
Unicode | ISO standards | IEC standards
ISO 10646 | Universal Character Set | ISO 10646 | ISO/CEI 10646 | UCS | ISO 10646 | ISO/IEC 10646 | ISO/IEC 10646 | 通用字符集
This article is licensed under the GNU Free Documentation License.
It uses material from the
"Universal Character Set".
Home Page • arts • business • computers • games • health • hospitals • home • kids & teens • news • physicians • recreation• reference • regional • science • shopping • society • sports • world