In computer science, tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.
Take, for example, the following string. Unlike humans, a computer cannot intuitively 'see' that there are 9 words. To a computer this is only a series of 43 characters.
The quick brown fox jumps over the lazy dog
A process of tokenization could be used to split the sentence into word tokens. Although following example is given as XML there are many ways to store tokenized input:
The term is also used when, during the parsing of source code of some programming languages, the symbols are converted into a shorter representation which uses less memory. Commands such as print may be mapped to a number representation. Most BASIC interpreters used this to save room, a command such as print would be replaced by a single number which uses much less room in memory. In fact most lossless compression systems use a form of tokenization, although it's typically not referred to as such.
In human cognition tokenization is often used to refer to the process of converting a sensory stimulus into a cognitive "token" suitable for internal processing. A stimulus that is not correctly tokenized may not be processed or may be incorrectly merged with other stimuli.
This article is licensed under the GNU Free Documentation License.
It uses material from the
"Tokenization".
Home Page • arts • business • computers • games • health • hospitals • home • kids & teens • news • physicians • recreation• reference • regional • science • shopping • society • sports • world