In textual criticism and bibliography, collation is the reading of two (or more) texts side-by-side in order to note their differences.
In printing and photocopying, collation is the arrangement of pages in order when several copies of a document are bound after printing or copying.
Collation can also refer to the detailed bibliographical description of a book or the comparison of the physical makeup of two copies of a book.
Collation differs from classification in that classification is concerned with arranging information into logical categories, while collation is concerned with the partial ordering of those categories.
Collation differs from a sorting algorithm in that whereas sorting algorithms decide which pairs of elements to compare, collation defines a total order ≤ on pairs that the sorting algorithm uses to determine when to swap the elements (usually a lexicographical order). In fact, sorting algorithms are often implemented to take a collation as an input.
While this might appear to work only for numbers, computers can use this method for any textual information since computers internally use character sets which assign a numeric code point to each letter or glyph. For example, a computer using ASCII code (or any of its supersets such as Unicode) and numerical sorting would collate the list of characters a · b · C · d · $ to $ · C · a · b · d.
The numerical values that ASCII uses are $ = 36, a = 97, b = 98, C = 67, and d = 100, resulting in ASCIIbetical order.
This style of collation is commonly used, often with the refinement of converting uppercase letters to lowercase before comparing ASCII values, since most people do not expect capitalised words to jump the head of the list.
For example, the list of words foo · bar · bibble collates to bar · bibble · foo because (1) f comes after b so bar and bibble both precede foo and (2) a comes before i so bar precedes bibble.
Numerical sorting (not to be confused with sorting of numbers, see below) on a computer and alphabetical sorting often produce the same ordering for English.
The difference between computer-style numerical sorting and true alphabetical sorting becomes obvious in languages using an extended Latin alphabet.
For example, the thirty-letter alphabet of Spanish treats ñ as a basic letter following n, and formerly treated ch and ll as basic letters following c, l, respectively. Ch and ll are still considered letters, but are alphabetized as digraphs. (The new alphabetization rule was issued by the Royal Spanish Academy in 1994. On the other hand, the letter rr follows rqu as expected, both with and without the 1994 alphabetization rule.) A numeric sort may order ñ incorrectly following z and treat ch as c + h, also incorrect when using pre-1994 alphabetization.
Similar differences between computer numeric sorting and alphabetic sorting occur in Danish and Norwegian (in some cases, aa is ordered as å at the end of the alphabet), German (ß is ordered as s + s; ä, ö, ü are ordered as a + e, o + e, u + e in phone books, but as o elsewhere, and behind o in Austria), Icelandic (ð follows d), English (æ is ordered as a + e), and many other languages.
Usually the spaces or hyphens between words are ignored.
See also Latin alphabet for a list of collating rules for Latin-based alphabets.
Languages that used a syllabary or abugida instead of an alphabet (for example, Cherokee) can use approximately the same system if there is a set ordering for the symbols.
Another form of collation is radical-and-stroke sorting, used for non-alphabetic writing systems such as Chinese hanzi and Japanese kanji, whose thousands of symbols defy ordering by convention. In this system, common components of characters are identified; these are called radicals in Chinese and logographic systems derived from Chinese. Characters are then grouped by their primary radical, then ordered by number of pen strokes within radicals. When there is no obvious radical or more than one radical, convention governs which is used for collation. For example, the Chinese character for "mother" (媽) is sorted as a thirteen-stroke character under the three-stroke primary radical (女).
The radical-and-stroke system is cumbersome compared to an alphabetical system in which there are a few characters, all unambiguous. The choice of which components of a logograph comprise separate radicals and which radical is primary is not clear-cut. As a result, logographic languages often supplement radical-and-stroke ordering with alphabetic sorting of a phonetic conversion of the logographs. For example, the kanji word Tōkyō'' (東京), the Japanese name of Tokyo can be sorted as if it were spelled out in the Japanese alphabet as "to-u-ki-yo-u" (とうきょう).
Nevertheless, the radical-and-stroke system is the only practical method for constructing dictionaries that someone may use to look up a logograph whose pronunciation is unknown.
A similar complication arises when special characters such as hyphens or apostrophes appear in words or names. Any of the same rules as above can be used in this case as well; however, the strict ASCII sorting no longer corresponds exactly to any of the rules.
In telephone directories in English speaking countries, surnames beginning with Mc are sometimes sorted as if starting with Mac and placed between "Mabxxx" and "Madxxx". Under these rules, the telephone directory order of the following names would be: Maam, McAllan, Macbeth, MacCarthy, McDonald, Macy, Mboko.
In certain contexts, very common words (such as articles) at the beginning of a sequence of words are not considered for ordering, or are moved to the end. So "The Shining" is considered "Shining" or "Shining, The" when alphabetizing and therefore is ordered before "Summer of Sam". This rule is fairly easy to capture in an algorithm, but many programs rely instead on simple lexicographic ordering. One fairly quaint exception to this rule is the flying of the flag of The Former Yugoslav Republic of Macedonia at the United Nations between those of Thailand and Timor Leste.
For example, Windows XP does this when sorting file names. Sorting decimals properly is a bit more difficult, due to the fact that different locales use different symbols for a decimal point, and sometimes the same character used as a decimal point is also used as a separator, for example "Section 3.2.5". There is no universal answer for how to sort such strings; any rules are application dependent.
Also -13 comes alphabetically after -12 although it is less. With negative numbers, to make ascending order correspond with alphabetical sorting, more drastic measures are needed such as adding a constant to all numbers to make them all positive.
Orthography | Information science
Abecední řazení | Sortierung | Classement alphabétique | Clasificación alfabética | Stafrófsröð | Aakkosjärjestys | Ordem alfabética
This article is licensed under the GNU Free Documentation License.
It uses material from the
"Collation".
Home Page • arts • business • computers • games • health • hospitals • home • kids & teens • news • physicians • recreation• reference • regional • science • shopping • society • sports • world