Source coding is a mapping from (a sequence of) symbols from an information source to a sequence of alphabet symbols (usually bits) such that the source symbols can be exactly recovered from the binary bits (lossless source coding) or recovered within some distortion (lossy source coding). This is the concept behind data compression.
In information theory, the source coding theorem (Shannon 1948) informally states that:
Given is an i.i.d. source, its time series X1, ..., Xn is i.i.d. with entropy H(X) in the discrete-valued case and differential entropy in the continuous-valued case. The Source coding theorem states that for any for any rate larger than the entropy of the source, there is large enough and an encoder that takes i.i.d. repetition of the source, , and maps it to binary bits such that the source symbols are recoverable from the binary bits with probability at least .
Proof for achievablity
Fix some for any . The typical set,, is defined as follows:
AEP shows that for large enough n, the probability that a sequence generated by the source lies in the typical set,, defined as approaches one. In particular there for large enough n, (See AEP for a proof):
The definition of typical sets implies that those sequences that lie in the typical set satisfy:
Note that:
Since bits are enough to point to any string in this set.
The encoding algorithm: The encoder checks if the input sequence lies within the typical set; if yes, it outputs the index of the input sequence within the typical set; if not, the encoder outputs an arbitrary digit number. As long as the input sequence lies within the typical set (with probability at least ), the encoder doesn't make any error. So, the probability of error of the encoder is bounded above by
Proof for converse The converse is proved by showing that any set of size smaller than (in the sense of exponent) would cover a set of probability bounded away from 1.
A variable length source code maps source symbols to a variable number of bits. This allows sources to be compressed and decompressed with zero error. It is possible to compress an i.i.d. source arbitrarily close to its entropy with variable length codes.
Some examples of well-known variable length coding strategies are Huffman coding, Lempel-Ziv coding and Arithmetic coding.
Variable length codes can be strictly nested in order of decreasing generality as non-singular, uniquely decodable and instantaneous (prefix free). Instantaneous codes are always uniqely decodable, which in turn are always non-singular.
A code is non-sinuglar if each source symbol is mapped to a different string, i.e. the mapping from source symbols to bit strings is one-to-one.
The extension of a code is the mapping on finite length source sequences to finite length bit strings that is induced by the original code by applying concatenating the codewords of each symbol. A code is uniquely decodable if its extension is non-singular.
A code is 'instantaneous' if no codeword is a prefix of another codeword. This means that symbols can be decoded instantaneously after their entire codeword is received.
An example of an instantaneous variable length code is shown below.
The source is one of four symbols: a,b,c or d. The binary codeword for each symbol is given. The encoding and decoding of a portion of an example source sequence is given below.
The advantage of a variable length code is that unlikely source symbols can be assigned longer codewords and likely source symbols can be assigned shorter codewords, thus giving a low expected codeword length. For the above example, if the probabilites of a,b,c and d were , the expected number of bits used to represent a source symbol would be . The entropy of this source is 1.75 bits per symbol, so this code compresses the source as much as possible so that the source can be recovered with zero error.
Define typical set as:
Then, for given , for n large enough, . Now we just encode the sequences in the typical set, and usual methods in source coding show that the cardinality of this set is smaller than . Thus, on an average, bits suffice for encoding with probability greater than , where can be made arbitrarily small, by making n larger.
This article is licensed under the GNU Free Documentation License.
It uses material from the
"Source coding".
Home Page • arts • business • computers • games • health • hospitals • home • kids & teens • news • physicians • recreation• reference • regional • science • shopping • society • sports • world