470 likes | 658 Views
New approaches to language and prehistory from typology, genetics, and quantitative linguistics. S øren Wichmann MPI-EVA & Leiden University. Lecture IV: The utility of phylogenetic algorithms and software: some case studies. Case study A.
E N D
New approaches to language andprehistory from typology, genetics,and quantitative linguistics Søren Wichmann MPI-EVA & Leiden University
Lecture IV: The utility of phylogenetic algorithms and software: some case studies
Case study A Can the algoritms help us in refining lexicostatistics? Let‘s compare a phylogeny based on traditional methods and one based on lexicostatistics using modern phylogenetic methods. Language family studied: Mixe-Zoquean
Classification criteria: shared (mostly phonological) innovations: 1. Not defined 2. /h/ inserted before final consonant 3. Vowel length is lost 4. Word-final vowels lost 5. Palatalizing effect of front vowels 6. Apparently some morphological and lexical innovations (not clear) 7. (Mostly implicit by the language being intermediate in several respects, and also having its own innovations, see 12) 8a. Syllable-initial nasals become prenasalised stops; 8b. /t/ and /ts/ merge before /i/; 8c. final devoicing 9a. Development of a quantity distinction in consonants; 9b. An analogical extension involving verb classes 12. Unlaut, syncope, anaptyxis, strees changes 22. An /h/ is inserted into verb roots whose final consonant is a stop (from Wichmann 1995)
Case study B Sweet dreams and crude reality: evaluating Dunn et al. (2005) on Austronesian and Papuan
What does it take for a match between two tree to be „close“? A crude test of how well two trees match is to count the Robinson-Foulds distance or „symmetrical“ differences. This is a count of how many nodes that are in one tree but not the other. First tree A is compared to tree B and then tree B to tree A and the result is divided by two (implemented in TreeDist.exe in the Phylip packages, among others)
The distance between the „traditional“ and the „typological“ Austronesian trees is 4. Now we may ask: if we generate 10,000 random trees with 16 taxa, how like are you to get draw a random pair from this pool that has 4 or less differences. I carried out this test (in collaboration with Mihai Albu, who generated the trees, and Thomas Mailund, who ran the trees through his program, which is similar to TreeDist.exe).
The conclusion seems to be in favor of Dunn et al., but. . . the time depth of the Melomelanesian subgroup of Austronesian is very shallow, perhaps 1000 years or so (this is to be checked). The time depth of the Papuan group, if it exists at all could be 10 times as large. How good does a method work at such a time depth if it only barely works at a shallow level?
On a more optimistic note: If the exact same dataset that Dunn et al. used (supplied online along with their paper) is subjected to a Bayesian analysis, the Robinson-Foulds distance is down to 3! (Thanks to Arpiar Saundars for carrying out the analysis)
Traditional tree Tree produced by Bayesian analysis of typological data
The probability of a Robinson-Foulds distance of 3 is around 0.01
Intermediary conclusion Given that a reasonably good tree can be obtained by using typological data the method could perhaps work. And it could work even better using an adequate algorithm. . .
A little problems not to be overlooked: Hm, low bootstrap values. . . .
Bootstrap values of all inner nodes 0.221 1 3 6 7 8 9 13 15, 0.274 1 7 8 9, 0.308 1 3 6 7 8 9 13 14 15, 0.362 1 3 5 6 7 8 9 13 14 15, 0.433 1 3 7 8 9, 0.506 1 3 4 5 6 7 8 9 10 13 14 15, 0.524 1 2 3 4 5 7 8 9 10 11 12 14, 0.596 1 2 3 5 6 7 8 9 11 12 13 14 15, 0.661 1 2 3 4 5 6 8 10 11 12 13 14 15, 0.673 1 7 9, 0.701 1 3 4 5 6 7 8 9 10 12 13 14 15, 0.939 1 2 3 4 5 7 8 9 10 11 12 14 15,
What have Dunn et al. accomplished? • They are the first to have published phylogenetic trees using typological data as input • They have produced a nice dataset, including new data from fieldwork BUT • The comparison between an Austronesian tree based on the comparative method and one based on typological data is not carried out in a rigorous manner • The algorithm used (Maximum Parsimony) is the worst one available • The data are organised in binary variables, which is the worst possible way because the chance factor increases as the possible number of values of a features decreases • They argue that a fit between the proposed phylogeny and geographical patterns is in favor of the proposed phylogeny being real and not due to diffusion. But precisely diffused items are expected to pattern geographically. And actually the fit is poor. • The ask a program to produce a tree. It obeys. But it also produces bootstrap values where 11 out of 12 inner nodes are below or way below 90%. This is a tree that doesn‘t want to be a tree. Yet they accept it at face value. CONCLUSION (1) • Nothing substantial has been accomplished, neither methodologically nor empirically CONCLUSION (2) • Don‘t believe everything you read in Science and—trust me—don‘t necessarily trust people who work at Max Planck institutes
Case study C Let‘s dream on. . . . Towards a subgrouping of proto-New World
Step 1 Make a selection of languages belonging to the West Coast, as defined by speakers being dependant on the Pacific for subsistence or navigating on it. Assumption: there could be a group within the New World family which is mostly confined to the Pacific Coast. The list: Haida, Squamish, Makah, Quileute, Coos (Hanis), Karok, Wappo, Maricopa, Huave, Quechua, Aymara, Epena Pedee, Awa Pit, Mapudungun, Qawasqar
Step 2 Find out whether there are traits among the American founder traits that are significantly better represented in this group of languages. Result: two traits: fusion of Agent and Patient markers; inflectional synthesis of the verb: 8-9 catogories per word.
Step 3 Extend the set of Pacific languages to Pacific-Style languages by the criterion of sharing one of the two „significantly Pacific“ features
Step 4 Reduce the set by removing languages that don‘t shared at least 25% of all WALS features that have a significantly Pacific distribution
Step 5 Make a classification of Pacific-Style languages, using many WALS feautres (here 96 features)
Step 6 Fiddle a bit further, and interesting patterns emerge (in the next, Haida is excluded)
Conclusion A knowledge of ancestral states at the root of the tree can significantly improve subgrouping. Such „Founder traits“ also lend more credibility to a phylogeny. To be able to argue for new genealogical relations by using typological data we need either (1) strongly support roots, involving comparison with languages of the rest of the world or (2) strong internal statistical support such as high bootstrap values. Preferably we should have both. There is light at the end of the tunnel.
Thanks Keep in touch: wichmann@eva.mpg.de