Mass lexical comparison or mass comparison is a highly controversial method developed by the well-known linguist Joseph Greenberg to find genetic relationships among languages in the remote past, beyond the limits of the traditional comparative method, or in situations where there are too many languages to practically apply the latter without many generations of work.
For instance, one could prove that Spanish is related to Italian by showing that many words of the former can be mapped to corresponding words of the latter by a relatively small set of replacement rules — such as change initial es- by s-, final -os by -i, etc. Many similar correspondences exist between the grammars of the two languages. Since those systematic correspondences are extremely unlikely to be random coincidences, the most likely explanation by far is that the two languages have evolved from a single ancestral tongue (Latin, in this case).
Most pre-historical language groupings that are widely accepted today — such as the Indo-European, Algonquian, and Bantu families — have been proved in this way, although many — such as Niger-Congo, and until quite recently Afro-Asiatic and Sino-Tibetan — have not, and some families whose proponents claim to have proved them in this way (eg. Nostratic) have not been widely accepted.
As those sporadic changes accumulate, they will increasingly obscure the systematic ones — just as enough dirt and scratches on a photograph will eventually make the face unrecognizable. Presumably for this reason, the comparative method has not been able to provide reliable evidence of genetic relationship between languages that have split off more than 10,000 years ago. Considering that humans probably have been speaking fully developed languages since at least 60,000 years ago (when Australia was first populated), it is hardly surprising that many languages and language families still have no known relationship with other groups.
Departing from the traditional criterion, Greenberg did not look for any systematic trend in these similarities, trusting that a sufficiently large percentage of sufficiently similar pairs among the samples would be enough to prove a common origin for the two languages. This assumption is valid in principle, because is expected to be higher for languages that have split off more recently, and to decrease as the split recedes into the past. The chief difficulty lies is deciding what constitutes "sufficient" similarity, particularly bearing in mind that many similarities are due to borrowing between languages and, far more commonly than is often realised, to coincidence.
Thus, paradoxically, the lexical comparison method becomes more accurate as the investigation recedes into the past — which offsets to some extent the increased level of statistical noise in the measurements. This stands in contrast to the traditional comparative method, which becomes more unreliable as it is applied to broader language groups — since the structural comparisons must be applied to increasingly dubious, inaccurate, and incomplete reconstructed proto-languages.
The mass lexical comparison method also has the advantage that it can reconstruct the broad phylogeny for a large set of languages directly from raw lexical samples, without the need to wait for detailed morphological studies of each language or the reconstruction of proto-languages for each branch — which in the case of Native American languages, for example, would take an enormous amount of work.
Words for "modern" concepts — such as "wine", "horse", and "steel" — may show spurious similarities between unrelated languages, due to the name being imported by a culture together with the thing; e.g. Spanish pan and Japanese pan ("bread"). Alternatively, the names of recently imported concepts may get invented separately in related languages, such as computadora ("computer") in Latin American Spanish and ordinateur in French. Either way, such words would only add noise and bias to the comparison.
Unfortunately, this computation is very difficult to do. For one thing, the similarity level is expected to depend on the phonetic repertoires of the two languages; thus, for instance, one expects more chance resemblances between two languages that have few vowels and many consonants, than between a vowel-rich and a vowel-poor language. Similar biases can be expected when comparing languages that allow consonant clusters with those that don't, or polysyllabic languages with monosyllabic ones. It follows that deciding what would be a significant level of similarity would require a stochastic model for a "random lexicon" that took into account letter frequencies, syllable structure, and many other similar statistics.
At the same time, the correspondences used in the method are often tenuous, to say the least, requiring at times a correspondence of only one phoneme, or even only one characteristic (labial, dental, etc.). A wide semantic range is also allowed; for example, words were compared by Greenberg, in his book on the American languages, meaning arm, shoulder, armpit, forearm, elbow, etc. Thus, using this method, Lyle Campbell, a linguist specializing in the languages of the Americas and author of a review of Greenberg's book, was able to establish a correspondence between the proposed Amerind language and Finnish, and others were able to do so with Latin and many languages obviously not related to those of the Americas.
Ideally, such words ought to be excluded from the sample lexicon; but the onomatopeic origin of a word may be hard to recognize in its present form. Even basic words like "milk" or "wind" have been claimed to reflect the corresponding sounds (those of sucking and blowing, respectively). Unfortunately, the impact of these "natural false cognates" in the similarity measure is hard to estimate.
As a consequence of these semantic shifts and synonymies, the construction of the representative lexicon for a language typically involves many choices that must often be made on subjective criteria. These choices may be unconsciously biased towards words that are similar to those previously chosen for other languages, thus artificially inflating the similarity measure . Unfortunately, the impact of this factor, too, is hard to quantify.
Finally, Greenberg's African classification is by no means the success that it is made out to be. Specialists have grave doubts about the unity of both Nilo-Saharan and Khoi-San, two of Greenberg's four families. The third, Afro-Asiatic, was taken over wholesale from previous work, not proposed anew by Greenberg. A number of languages included by Greenberg in his four families are considered isolates today. Furthermore, current classifications differ in major respects in the subgrouping of the major language families. In sum, although Greenberg's African classification represented an advance on the classification well known at the time, it was neither as successful nor as original as Greenberg and his advocates suggest and does not serve as a good model for evaluating the use of his approach.
A further consideration is that, insofar as mass lexical comparison is a legitimate scientific method, it must work when applied by others, not just Joseph Greenberg. In fact, it has been used by many others, with no discernible difference in application, and has produced results that are either not accepted or are considered to be clearly wrong. Among the examples that we may cite are cases of languages being wrongly classified as Indo-European discussed by Poser and Campbell (1992). To take another example, virtually no one today accepts the proposal by Radin (1919) that all of the languages of North America are related.
Although mass lexical comparison has a few ardent proponents among linguists and somewhat greater acceptance among non-linguists, it is rejected by most historical linguists, who view the comparative method as the only legitimate way to establish pre-historical common ancestry for languages.
This article is licensed under the GNU Free Documentation License.
It uses material from the
"Mass lexical comparison".
Home Page • arts • business • computers • games • health • hospitals • home • kids & teens • news • physicians • recreation• reference • regional • science • shopping • society • sports • world