1 / 35

Modelling language evolution

Modelling language evolution. Tandy Warnow The University of Texas at Austin. Species phylogeny. From the Tree of the Life Website, University of Arizona. Orangutan. Human. Gorilla. Chimpanzee. Possible Indo-European tree (Ringe, Warnow and Taylor 2000).

Download Presentation

Modelling language evolution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modelling language evolution Tandy Warnow The University of Texas at Austin

  2. Species phylogeny From the Tree of the Life Website,University of Arizona Orangutan Human Gorilla Chimpanzee

  3. Possible Indo-European tree(Ringe, Warnow and Taylor 2000)

  4. Controversies for Indo-European history • Subgrouping: Other than the 10 major subgroups, what is likely to be true? In particular, what about • Italo-Celtic, • Greco-Armenian, • Anatolian + Tocharian, • Satem Core?

  5. This talk • Empirical evidence of how estimated phylogenies depend upon both the data and the method - and can be wrong • Models of language evolution (from the earliest ones to more recent ones), why we need them, and what we still need to do. Note: simulations and estimation methods both depend upon model assumptions! • Results of simulation studies based upon some new models • Comments

  6. Nakhleh et al., Transactions of the Philological Society 2005 Methods studied: UPGMA (lexico-statistics), Neighbor joining, maximum parsimony, maximum compatibility, weighted MP, weighted MC, and Gray&Atkinson. Datasets: Four versions of the Ringe&Taylor IE data (lexical, morphological, and phonological characters): lexical only vs. all, screened vs. unscreened Observations: • UPGMA (lexico-statistics) does the worst - it splits known subgroups. • Other than UPGMA, all methods reconstruct the ten major subgroups, Anatolian + Tocharian, and Greco-Armenian. Nothing else is consistently reconstructed. • When using lexical data only, all methods group Italic, Celtic, and Germanic together. • Some methods (not all) will reconstruct different trees on different datasets. Screening datasets to remove obvious homoplasy can result in better (?) trees.

  7. Question: how to determine which phylogenies are reliable? • Data: need high quality data! • Phylogenetic reconstruction methods need to be tested before being trusted! Examples of possible tests: • Benchmark real datasets (need good benchmarks! Are there any?) • Simulated datasets (need good models!)

  8. Simulation study (cartoon)

  9. Simulation study (cartoon) FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

  10. Modelling language evolution • Models of evolution allow reconstruction methods to be evaluated in simulation. This allows us to understand the conditions under which each method will perform well. • Models of evolution (for simulation purposes) need to reflect good scholarship, and should be able to reproduce the properties of real data. • Models of evolution are also present in estimation methods, whether explicitly (as in ML or Bayesian) or implicitly.

  11. Issues in modelling language evolution • Character evolution model. • Variation between characters. • Cladogenesis model: tree vs. network vs. dialect continuum?

  12. Modelling the evolution of single linguistic characters • Types of linguistic characters: • Phonological (sound changes) • Lexical (meanings based on a wordlist) • Morphological (especially inflectional) • Modelling issues: state space, lexical clock, homoplasy, and polymorphism • Easy: lexical clock not believed, and most linguistic characters have infinite number of possible states. • More interesting: homoplasy, polymorphism, and variation between characters.

  13. Homoplasy-free evolution • When a character changes state, it changes to a new state not in the tree • In other words, there is no homoplasy (character reversal or parallel evolution) • First inferred for weird innovations in phonological characters and morphological characters in the 19th century. 0 0 1 0 0 0 0 1 1

  14. Lexical characters can also evolve without homoplasy • For every cognate class, the nodes of the tree in that class should form a connected subset - as long as there is no undetected borrowing nor parallel semantic shift. • However, our research suggests that ~15% of lexical characters evolve homoplastically. 1 1 1 0 0 0 1 1 2

  15. Polymorphism • Polymorphism means two or more states exhibited by the same language for a character. • Most common examples are lexical: two or more words for the same basic meaning. Examples: big/large,little/small, rock/stone. • Lexical polymorphism results primarily from semantic shift, but polymorphism due to borrowing also occurs. • Incidence: lexical polymorphism is very common but transient (almost all polymorphisms lost within a millenium). Less frequent for other types of characters.

  16. Modelling variation between characters: Rates-across-sites • If a site (i.e., character) is twice as fast as another on one edge, it is twice as fast everywhere. B D A C B D A C

  17. Modelling variation between characters: The no-common-mechanism model • In this model, there is a separate random variable for every combination of site and edge - the underlying tree is fixed, but otherwise there are no constraints on variation between sites. C A D B B D A C

  18. Homoplasy-free models without polymorphism • The earliest models were all tree models, homoplasy-free and obeyed the lexical clock. • Ringe-Warnow: “PP” (perfect phylogeny - i.e., homoplasy-free, no common mechanism, non-parametric tree model)

  19. Cladogenesis • The “speciation” model ranges from trees all the way to dialect continuums. Intermediate models include horizontal transfer (borrowing) and hybridization (creoles).

  20. Modelling borrowing: Networks and Trees within Networks

  21. Perfect Phylogenetic Network model • Nakhleh et al. Perfect Phylogenetic Network (PPN) model: all characters evolve without homoplasy down a tree contained within the network. Published in Language, 2005. • Warnow-Evans-Ringe-Nakhleh (2004): extends PPN model to allow for limited and identifiable homoplasy.

  22. “Perfect Phylogenetic Network” for IENakhleh et al., Language 2005

  23. What about polymorphism? • Our first model of polymorphism (Bonet et al., 1996) was a non-parametric model for homoplasy-free characters, no-common-mechanism model, with polymorphism due to semantic shift. • Three problems: (1) because it is non-parametric, it cannot be used for simulation (2) homoplasy is fairly frequent for lexical characters (15% of characters) (3) what about polymorphism due to borrowing?

  24. Nichols and Gray model for polymorphism • Geoff Nichols and Russel Gray (2006): Homoplasy-free, rates-across-sites, parametric model in which the character adds and loses states under a stochastic process. The number of states in a lineage can go up and down (including down to 0 and then back up). • Problems: (1) homoplasy is frequent in lexical characters (2) what is the linguistic process?

  25. What needs to be done in modelling • We need parametric models of character evolution that include reasonable levels of homoplasy, in which polymorphism arises due to semantic shift (conflation of two characters), by borrowing, or due to other linguistic processes. • We also need cladogenesis models that incorporate population-level processes, and can represent dialect continuums.

  26. Simulation study (Barbancon et al.) • Simulated evolution down networks with 30 leaves, three contact edges, and with moderate levels of homoplasy and borrowing for 300 lexical characters and 60 morphological characters. • Compared trees constructed by various methods to the “genetic tree” contained in the network, for topological accuracy. • Methods compared: NJ, UPGMA, weighted and unweighted MP and MC.

  27. Standard Model Conditions • Screened dataset • Lexical characters: 4% homoplastic, 10% evolve with borrowing • Morphological characters: no homoplasy nor borrowing • Unscreened dataset • Lexical characters: 20% homoplastic, 20% borrowed • Morphological characters: 5% homoplastic, no borrowing • Molecular clock for the cladogenesis model • No-common-mechanism model with moderate variation between characters • Lexical weight=1, morphological weight=50

  28. Simulation study (cartoon) FN FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

  29. Clocklike data

  30. Clocklike data

  31. Points • Screening the data helps to improve the phylogenetic accuracy of most methods. • When data are generated under network models, methods which reconstruct trees do not perform well. • Modelling helps us predict the conditions under which different methods will perform well, or poorly. The more accurate the models, the more relevant the predictions. We need better models!

  32. Future research • Testing other methods in simulation (including some network construction methods) • Formulating improved (more realistic) models of language evolution • Implementing simulation tools under these improved models • Developing estimation methods under these improved models • Reanalyzing IE, and looking at some new families (or subfamilies)

  33. Acknowledgements • Funding: NSF, the David and Lucile Packard Foundation, the Radcliffe Institute for Advanced Studies, The Program for Evolutionary Dynamics at Harvard, and the Institute for Cellular and Molecular Biology at UT-Austin. • Collaborators: Don Ringe, Steve Evans, Luay Nakhleh, and Francois Barbancon.

  34. For more information • Please see the Computational Phylogenetics for Historical Linguistics web site for papers, data, and additional material http://www.cs.rice.edu/~nakhleh/CPHL

  35. Differences between characters • Lexical: most easily borrowed (most borrowings detectable), and homoplasy relatively frequent (we estimate about 25-30% overall for our wordlist, but a much smaller percentage for basic vocabulary). Also, lexical characters have a high incidence (80%) of transient polymorphism. • Phonological: can still be borrowed but much less likely than lexical. Complex phonological characters are infrequently (if ever) homoplastic, although simple phonological characters very often homoplastic. • Morphological: least easily borrowed, least likely to be homoplastic. Rarely polymorphic.

More Related