In computer science, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol. It was developed by David A. Huffman as a Ph.D. student at MIT in 1952, and published in A Method for the Construction of Minimum-Redundancy Codes. Huffman was Professor Emeritus of Computer Science at the Baskin School of Engineering at the University of California, Santa Cruz until his death from cancer in 1999.
Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix-free code (that is, the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol) that expresses the most common characters using shorter strings of bits than are used for less common source symbols. Huffman was able to design the most efficient compression method of this type: no other mapping of individual source symbols to unique strings of bits will produce a smaller average output size when the actual symbol frequencies agree with those used to create the code. (Huffman coding is such a widespread method for creating prefix-free codes that the term "Huffman code" is widely used as a synonym for "prefix-free code" even when such a code was not produced by Huffman's algorithm.)
For a set of symbols with a uniform probability distribution and a number of members which is a power of two, Huffman coding is equivalent to simple binary block encoding.
Assertions of the optimality of Huffman coding should be phrased carefully, because its optimality can sometimes accidentally be over-stated. For example, arithmetic coding ordinarily has better compression capability, because it does not require the use of an integer number of bits for encoding each source symbol. LZW coding can also often be more efficient, particularly when the input symbols are not independently-distributed, because it does not depend on encoding each input symbol one at a time (instead, it batches up a variable number of input symbols into each encoded syntax element). The efficiency of Huffman coding also depends heavily on having a good estimate of the true probability of the value of each input symbol.
In 1951, David Huffman and his MIT information theory classmates were given the choice of a term paper or a final exam. The professor, Robert M. Fano, assigned a term paper on the problem of finding the most efficient binary code. Huffman, unable to prove any codes were the most efficient, was about to give up and start studying for the final when he hit upon the idea of using a frequency-sorted binary tree, and quickly proved this method the most efficient.
In doing so, the student outdid his professor, who had worked with information theory inventor Claude Shannon to develop a similar code. Huffman avoided the major flaw of Shannon-Fano coding by building the tree from the bottom up instead of from the top down.
| Input | Alphabet | a | b | c | d | e | f | g | h | i | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Costs | 10 | 15 | 5 | 15 | 20 | 5 | 15 | 30 | 5 | ||
| Output | Codewords | 000
| 010
| 0010
| 011
| 111
| 00110
| 110
| 10
| 00111
| |
| Weighted path length | 10 * 3 | 15 * 3 | 5 * 4 | 15 * 3 | 20 * 3 | 5 * 5 | 15 * 3 | 30 * 2 | 5 * 5 | = 355 |
The technique works by creating a binary tree of nodes. These can be stored in a regular array, the size of which depends on the number of symbols(N). A node can be either a leaf node or an internal node. Initially, all nodes are leaf nodes, which contain the symbol itself, the weight (frequency of appearance) of the symbol and optionally, a link to a parent node which makes it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain symbol weight, links to two child nodes and the optional link to a parent node. As a common convention, bit '0' represents following the left child and bit '1' represents following the right child. A finished tree has N leaf nodes and N−1 internal nodes.
A fast way to create a Huffman tree is to use the heap data structure, which keeps the nodes in partially sorted order according to a predetermined criterion. In this case, the node with the lowest weight is always kept at the root of the heap for easy access.
Creating the tree:
It is generally beneficial to minimize the variance of codeword length. For example, a communication buffer receiving Huffman-encoded data may need to be larger to deal with especially long symbols if the tree is especially unbalanced. To reduce variance every newly generated node must be favored among same weight nodes and placed as high as possible. This modification will retain the mathematical optimality of the Huffman coding while minimizing the length of the longest character code.
Huffman coding is optimal when the probability of each input symbol is a negative power of two. Prefix-free codes tend to have slight inefficiency on small alphabets, where probabilities often fall between these optimal points. "Blocking", or expanding the alphabet size by coalescing multiple symbols into "words" of fixed or variable length before Huffman coding, can help a bit, especially when adjacent symbols are correlated (as in the case of natural language text). Except in some cases with very short codewords, Huffman encoding is robust with respect to inaccuracies in estimated codeword frequencies: the effect of inappropriately long codewords is almost exactly cancelled out by the effect of inappropriately short codewords. The worst case for Huffman coding can happen when the probability of a symbol exceeds 2-1 = 0.5, making the upper limit of inefficiency unbounded. These situations often respond well to a form of blocking called run-length encoding.
Extreme cases of Huffman codes are connected with Fibonacci and Lucas numbers and Wythoff array.
Arithmetic coding produces slight gains over Huffman coding, but in practice these gains have seldom been large enough to offset arithmetic coding's higher computational complexity and patent royalties. (As of November 2001, IBM owns patents on the core concepts of arithmetic coding in several jurisdictions.)
Huffman coding with unequal letter costs is the generalization in which this assumption is no longer assumed true: the letters of the encoding alphabet may have non-uniform lengths, due to characteristics of the transmission medium. An example is the encoding alphabet of Morse code, where a 'dash' takes longer to send than a 'dot', and therefore the cost of a dash in transmission time is higher. The goal is still to minimize the weighted average codeword length, but it is no longer sufficient just to minimize the number of symbols used by the message.
Huffman coding today is often used as a "back-end" to some other compression method. DEFLATE (PKZIP's algorithm) and multimedia codecs such as JPEG and MP3 have a front-end model and quantization followed by Huffman coding.
Lossless compression algorithms | Coding theory
Huffmanovo kódování | Shannon-Fano-Kodierung | Codificación Huffman | Codage de Huffman | 허프만 코딩 | Codifica di Huffman | קוד הופמן | Huffmancodering | ハフマン符号 | Kodowanie Huffmana | Codificação de Huffman | Huffmanin koodaus | Алгоритм Хаффмана | Huffmankodning | รหัสฮัฟแมน และ รหัสแชนนอน-ฟาโน | 哈夫曼树
This article is licensed under the GNU Free Documentation License.
It uses material from the
"Huffman coding".
Home Page • arts • business • computers • games • health • hospitals • home • kids & teens • news • physicians • recreation• reference • regional • science • shopping • society • sports • world