UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode created by Ken Thompson and Rob Pike. It is able to represent any universal character in the Unicode standard, yet is backwards compatible with ASCII. For this reason, it is steadily becoming the preferred encoding for email, web pages, and other places where characters are stored or streamed.
UTF-8 uses one to four bytes (strictly, octets) per character, depending on the Unicode symbol. Only one byte is needed to encode the 128 US-ASCII characters (Unicode range to U+007F). Two bytes are needed for Latin letters with diacritics, combining them. Also two bytes are used to represent a character in Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and Thaana alphabets (Unicode range U+0080 to U+07FF). Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). Four bytes are needed for characters in other planes of Unicode.
Four bytes may seem like a lot for one character (code point). However code points outside the Basic Multilingual Plane are generally very rare. Furthermore, UTF-16 (the main alternative to UTF-8) also needs four bytes for these code points. Whether UTF-8 or UTF-16 is more efficient depends on the range of code points being used. However, the differences between different encoding schemes can become negligible with the use of traditional compression systems like DEFLATE. For short items of text where traditional algorithms do not perform well and size is important, the Standard Compression Scheme for Unicode could be considered instead.
The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data with UTF-8 as at least one supported encoding. The Internet Mail Consortium (IMC) recommends* that all email programs must be able to display and create mail using UTF-8.
There are several current, slightly different definitions of UTF-8 in various standards documents:
They supersede the definitions given in the following obsolete works:
They are all the same in their general mechanics with the main differences being on issues such as allowed range of code point values and safe handling of invalid input. The bits of a Unicode character are divided into several groups which are then divided among the lower bit positions inside the UTF-8 bytes.
A character whose code point is below U+0080 is encoded with a single byte that contains its code point: these correspond exactly to the 128 characters of 7-bit ASCII. In other cases, up to four bytes are required. The uppermost bit of these bytes is 1, to prevent confusion with 7-bit ASCII characters and therefore keep standard byte-oriented string processing safe.
| Code range hexadecimal | Scalar value binary | UTF-8 binary | Notes |
|---|---|---|---|
| 000000–00007F | 0xxxxxxx | 0xxxxxxx | ASCII equivalence range; byte begins with zero |
| seven x | seven x | ||
| 000080–0007FF | 00000xxx xxxxxxxx | 110xxxxx 10xxxxxx | first byte begins with 110, the following byte begins with 10. |
| three x, eight x | five x, six x | ||
| 000800–00FFFF | xxxxxxxx xxxxxxxx | 1110xxxx 10xxxxxx 10xxxxxx | first byte begins with 1110, the following bytes begin with 10. |
| eight x, eight x | four x, six x, six x | ||
| 010000–10FFFF | 000xxxxx xxxxxxxx xxxxxxxx | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | First byte begins with 11110, the following bytes begin with 10 |
| five x, eight x, eight x | three x, six x, six x, six x |
So the first 128 characters need one byte. The next 1920 characters need two bytes to encode. This includes Latin alphabet characters with diacritics, Greek, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters. The rest of the BMP characters use three bytes, and additional characters are encoded in four bytes.
By continuing the pattern given above it is possible to deal with much larger numbers. The original specification allowed for sequences of up to six bytes covering numbers up to 31 bits (the original limit of the universal character set). However, UTF-8 was restricted by RFC 3629 to use only the area covered by the formal Unicode definition, U+0000 to U+10FFFF, in November 2003. With these restrictions, the following byte values never appear in a legal UTF-8 sequence:
| Codes (binary) | Codes (hexadecimal) | Notes |
|---|---|---|
| 1100000x | C0, C1 | Overlong encoding: lead-byte of a 2 byte sequence, but code point <= 127 |
| 1111111x | FE, FF | Invalid: lead-byte of a 7/8 byte sequence |
| 111110xx 1111110x | F8, F9, FA, FB, FC, FD | Restricted by RFC 3629: lead-byte of a 5/6 byte sequence |
| 11110101 1111011x | F5, F6, F7 | Restricted by RFC 3629: lead byte of codepoint above 10FFFF |
While the last two categories were technically allowed by earlier UTF-8 specifications, no characters were ever assigned to the code points they represent so they should never have appeared in actual text.
However, Java represents strings internally using a non-standard variant of UTF-8 called for object serialization, for the Java Native Interface, and for embedding constants in class files. There are two differences between modified and standard UTF-8. The first difference is that the null character (U+0000) is encoded with two bytes instead of one, specifically as 11000000 10000000. This ensures that there are no embedded nulls in the encoded string, presumably to address the concern that if the encoded string is processed in a language such as C where a null byte signifies the end of a string, an embedded null would cause the string to be truncated.
The second difference is in the way characters outside the Basic Multilingual Plane are encoded. In standard UTF-8 these characters are encoded using the four-byte format above. In modified UTF-8 these characters are first represented as surrogate pairs (as in UTF-16), and then the surrogate pairs are encoded individually in sequence as in CESU-8. The reason for this modification is more subtle. In Java a character is 16 bits long; therefore some Unicode characters require two Java characters in order to be represented. This aspect of the language predates the supplementary planes of Unicode; however, it is important for performance as well as backwards compatibility, and is unlikely to change. The modified encoding ensures that an encoded string can be decoded one UTF-16 code unit at a time, rather than one Unicode code point at a time. Unfortunately, this also means that characters requiring four bytes in UTF-8 require six bytes in modified UTF-8.
The encoding of UTF-8 is based loosely on Huffman coding, a way of representing frequency-sorted binary trees. As a consequence of the exact mechanics of UTF-8, the following properties of multi-byte sequences hold:
0.
110 for two-byte sequences; 1110 for three-byte sequences, and so on.
10 as their two most significant bits.
UTF-8 was designed to satisfy these properties in order to guarantee that no byte sequence of one character is contained within a longer byte sequence of another character. This ensures that byte-wise sub-string matching can be applied to search for words or phrases within a text; some older variable-length 8-bit encodings (such as Shift-JIS) did not have this property and thus made string-matching algorithms rather complicated. Although this property adds redundancy to UTF-8–encoded text, the advantages outweigh this concern; besides, data compression is not one of Unicode's aims and must be considered independently. This also means that if one or more complete bytes are lost due to error or corruption, one can resynchronize at the beginning of the next character and thus limit the damage.
Also due to the design of the byte sequences, if a sequence of bytes supposed to represent text validates as UTF-8 then it is fairly safe to assume it is UTF-8. The chance of a random sequence of bytes being valid UTF-8 and not pure ASCII is 1 in 32 for a 2 byte sequence, 5 in 256 for a 3 byte sequence and even lower for longer sequences.
While natural languages encoded in traditional encodings are far from random byte sequences they are also unlikely to produce byte sequences that would pass a UTF-8 validity test and then be misinterpreted (obviously pure ASCII text would pass a UTF-8 validity test but provided the legacy encodings under consideration are also ASCII based this is not a problem). For example for ISO-8859-1 text to be misrecognized as UTF-8 the only non-ASCII characters in it would have to be in sequences starting with either an accented letter or the multiplication symbol and ending with a symbol.
You can use the bit patterns to identify UTF-8 characters. If the byte's first hex code begins with 0–7, it is an ASCII character. If it begins with C or D, it is an 11 bit character (expressed in two bytes.) If it begins with E, it is 16 bit (expressed in 3 bytes,) and if it begins with F, it is 21 bits (expressed in 4 bytes.) 8 through B cannot be first hex codes, but all following bytes must begin with a hex code between 8 through B. Thus, you can tell at a glance that "0xA9" is not a valid UTF-8 character, but that "0x54" or "0xE3 0xB4 0xB1" is a valid UTF-8 character.
The exact response required of a UTF-8 decoder on invalid input is not uniformly defined by the standards. In general, there are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:
It is possible for a decoder to behave in different ways for different types of invalid input.
Overlong forms are one of the most troublesome types of data. The current RFC says they must not be decoded but older specifications for UTF-8 only gave a warning and many simpler decoders will happily decode them. Overlong forms have been used to bypass security validations in high profile products including Microsoft's IIS web server. Therefore, care must be taken to avoid security issues if validation is performed before conversion from UTF-8.
To maintain security in the case of invalid input there are two options. The first is to decode the UTF-8 before doing any input validation checks. The second is to use a decoder that, in the event of invalid input, returns either an error or text that the application considers to be harmless. Another possibility is to avoid conversion out of UTF-8 altogether but this relies on any other software that the data is passed to safely handling the invalid data.
Another consideration is error recovery, To guarantee correct recovery after corrupt or lost bytes decoders must be able to recognise the difference between lead and trail bytes rather than just assuming that bytes will be of the type allowed in their position.
UTF-8 was first officially presented at the USENIX conference in San Diego January 25–29 1993.
Microsoft's specification for Cab (MS Cabinet) from 1996 allows for UTF-8 encoded strings everywhere specifically (though this was before UTF-8 was actually formally standardised), but the encoder never actually implemented it.
صيغة التحويل الموحد-8 | UTF-8 | UTF-8 | UTF-8 | UTF-8 | UTF-8 | UTF-8 | UTF-8 | UTF-8 | UTF-8 | UTF-8 | UTF-8 | UTF-8 | UTF-8 | UTF-8 | UTF-8 | UTF-8 | UTF-8 | UTF-8 | UTF-8 | UTF-8 | UTF-8 | Unicode#UTF-8 | UTF-8 | UTF-8 | UTF-8