This page compares Unicode encodings. Two situations are considered: eight-bit-clean environments and environments like Simple Mail Transfer Protocol that forbid use of byte values that have the high bit set. Originally such prohibitions were to allow for links that used only seven data bits, but they remain in the standards and so software must generate messages that comply with the restrictions. Standard Compression Scheme for Unicode and Binary Ordered Compression for Unicode are excluded from the comparison tables because it is difficult to simply quantify their size.
For seven-bit environments, UTF-7 clearly wins over the combination of other Unicode encodings with quoted printable or base64.
Fixed-size characters can be helpful, but it should be remembered that even if there is a fixed width per code point (as in UTF-32), there is not a fixed width per displayed character due to combining characters. If you are working with a particular API heavily and that API has standardised on a particular Unicode encoding it is generally a good idea to use the encoding that the API does to avoid the need to convert before every call to the API. Similarly if you are writing server side software it may simplify matters to use the same format for processing that you are communicating in.
UTF-16 is popular because many APIs date to the time when Unicode was 16-bit fixed width. Unfortunately using UTF-16 makes characters outside the BMP a special case which increases the risk of oversights related to their handling.
Also UTF-16 and UTF-32 are not byte oriented and so a byte order must be selected when transmitting them over a byte oriented network or storing them in a byte oriented file. This may be achieved by standardising on a single byte order, by specifying the endianness as part of external metadata (for example the MIME charset registry has distinct UTF-16BE and UTF-16LE registrations as well as the plain UTF-16 one) or by using a Byte Order Mark at the start of the text.
Finally if the bytestream is subject to corruption then some encodings recover better than others. UTF-8 and UTF-EBCDIC are best in this regard as they can always resyncronise at the start of the next good character. UTF-16 and UTF-32 will handle corrupt bytes well (again recovering on the next good character) but a lost byte will garble all following text. GB18030 may be thrown out of sync by a corrupt or missing byte and has no designed in recovery.
| Code range (hexadecimal) | UTF-8 | UTF-16 | UTF-32 | UTF-EBCDIC | GB18030 |
|---|---|---|---|---|---|
| 000000 – 00007F | 1 | 2 | 4 | 1 | 1 |
| 000080 – 00009F | 2 | 2 for characters inherited from GB2312/GBK (e.g. most Chinese characters) 4 for everything else. | |||
| 0000A0 – 0003FF | 2 | ||||
| 000400 – 0007FF | 3 | ||||
| 000800 – 003FFF | 3 | ||||
| 004000 – 00FFFF | 4 | ||||
| 010000 – 03FFFF | 4 | 4 | 4 | ||
| 040000 – 10FFFF | 5 |
| code range (hexadecimal) | UTF-7 | UTF-8 quoted printable | UTF-8 base64 | UTF-16 quoted printable | UTF-16 base64 | UTF-32 quoted printable | UTF-32 base64 | GB18030 quoted printable | GB18030 base64 |
| 000000 – 000032 | same as 000080–00FFFFFF | 3 | 1⅓ | 6 | 2⅔ | 12 | 5⅓ | 3 | 1⅓ |
| 000033 – 00003C | 1 for "direct characters" and possibly "optional direct characters" (depending on the encoder setting) 2 for +, otherwise same as 000080–00FFFFFF | 1 | 1⅓ | 4 | 2⅔ | 10 | 5⅓ | 1 | 1⅓ |
| 00003D (equals sign) | 3 | 1⅓ | 6 | 2⅔ | 12 | 5⅓ | 3 | 1⅓ | |
| 00003E – 00007E | 1 | 1⅓ | 4 | 2⅔ | 10 | 5⅓ | 1 | 1⅓ | |
| 00007F | 5 for an isolated case inside a run of single byte characters. For runs 2⅔ per character plus padding to make it a whole number of bytes plus two to start and finish the run | 3 | 1⅓ | 6 | 2⅔ | 12 | 5⅓ | 3 | 1⅓ |
| 000080 – 0007FF | 6 | 2⅔ | 2–6 depending on if the byte values need to be escaped | 2⅔ | 8–12 depending on if the final two byte values need to be escaped | 5⅓ | 4–6 for characters inherited from GB2312/GBK (e.g. most Chinese characters) 8 for everything else. | 2⅔ for characters inherited from GB2312/GBK (e.g. most Chinese characters) 5⅓ for everything else. | |
| 000800 – 00FFFF | 9 | 4 | 2⅔ | 5⅓ | |||||
| 010000 – 10FFFF | same as two characters from above | 12 | 5⅓ | 8–12 depending on if the low bytes of the surrogates need to be escaped. | 5⅓ | 5⅓ | 8 | 5⅓ |
This article is licensed under the GNU Free Documentation License.
It uses material from the
"Comparison of Unicode encodings".
Home Page • arts • business • computers • games • health • hospitals • home • kids & teens • news • physicians • recreation• reference • regional • science • shopping • society • sports • world