1 / 49

Statistical NLP Spring 2010

Statistical NLP Spring 2010. Lecture 25: Diachronics. Dan Klein – UC Berkeley. Mutations of sequences. Time. Speciation. Time. Evolution: Main Phenomena. Tree of Languages. Challenge: identify the phylogeny Much work in biology, e.g. work by Warnow , Felsenstein , Steele…

lave
Download Presentation

Statistical NLP Spring 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical NLPSpring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley

  2. Mutations of sequences Time Speciation Time Evolution: Main Phenomena

  3. Tree of Languages • Challenge: identify the phylogeny • Much work in biology, e.g. work by Warnow, Felsenstein, Steele… • Also in linguistics, e.g. Warnow et al., Gray and Atkinson… http://andromeda.rutgers.edu/~jlynch/language.html

  4. Statistical Inference Tasks Inputs Outputs Modern Text focus Ancestral Word Forms fuego feu FRITPTES fuego oeuf  Cognate Groups / Translations huevo feu Phylogeny Grammatical Inference les faits sont très clairs

  5. Outline focus Ancestral Word Forms fuego feu fuego oeuf  Cognate Groups / Translations huevo feu Grammatical Inference les faits sont très clairs

  6. Language Evolution: Sound Change Eng. camera from Latin, “camera obscura” camera /kamera/ Latin Deletion: /e/ Change: /k/ .. /tʃ/ .. /ʃ/ Insertion: /b/ chambre /ʃambʁ/ French Eng. chamber from Old Fr. before the initial /t/ dropped

  7. Diachronic Evidence Yahoo! Answers [2009] Appendix Probi [ca 300] tonight not tonite tonitru non tonotru

  8. Synchronic (Comparative) Evidence

  9. Simple Model: Single Characters C G C G G G C C C C C G G

  10. [BGK 07] A Model for Words Forms focus /fokus/ LA /fogo/ IB fuoco /fwɔko/ fuego /fweɣo/ fogo /fogo/ IT ES PT

  11. Contextual Changes … o # f /fokus/ w ɔ # f /fwɔko/

  12. Changes are Systematic /kentrum/ /fokus/ /fokus/ /fogo/ /fogo/ /sentro/ /fwɔko/ /fwɔko/ /tʃɛntro/ /fweɣo/ /fweɣo/ /sentro/ /fogo/ /fogo/ /sentro/

  13. Experimental Setup • Data sets • Small: Romance • French, Italian, Portuguese, Spanish • 2344 words • Complete cognate sets • Target: (Vulgar) Latin • Large: Oceanic • 661 languages • 140K words • Incomplete cognate sets • Target: Proto-Oceanic [Blust, 1993] FRITPTES

  14. Data: Romance

  15. Learning: Objective /fokus/ /fogo/ /fwɔko/ /fweɣo/ /fogo/

  16. Learning: EM • M-Step • Find parameters which fit (expected) sound change counts • Easy: gradient ascent on theta • E-Step • Find (expected) change counts given parameters • Hard: variables are string-valued /fokus/ /fokus/ /fogo/ /fogo/ /fwɔko/ /fwɔko/ /fweɣo/ /fweɣo/ /fogo/ /fogo/

  17. [Holmes 01, BGK 07] Computing Expectations Standard approach, e.g. [Holmes 2001]: Gibbs sampling each sequence ‘grass’

  18. A Gibbs Sampler ‘grass’

  19. A Gibbs Sampler ‘grass’

  20. A Gibbs Sampler ‘grass’

  21. Getting Stuck ? How to jump to a state where the liquids /r/ and /l/ have a common ancestor?

  22. Getting Stuck

  23. [BGK 08] Solution: Vertical Slices Single Sequence Resampling Ancestry Resampling

  24. Details: Defining “Slices” The sampling domains (kernels) are indexed by contiguous subsequences (anchors) of the observed leaf sequences anchor Correct construction section(G) is non-trivial but very efficient

  25. Setup: Fixed wall time Synthetic data, same parameters Sum of Pairs Depth of the phylogenetic tree Results: Alignment Efficiency Is ancestry resampling faster than basic Gibbs? Hypothesis: Larger gains for deeper trees

  26. Results: Romance

  27. Learned Rules / Mutations

  28. Learned Rules / Mutations

  29. Comparison to system from [Oakes, 2000] • Uses exact inference and deterministic rules • Reconstruction of Proto-Malayo-Javanic cf [Nothefer, 1975] Oakes Us Comparison to Other Methods • Evaluation metric: edit distance from a reconstruction made by a linguist (lower is better)

  30. Data: Oceanic Proto-Oceanic

  31. Data: Oceanic http://language.psy.auckland.ac.nz/austronesian/research.php

  32. Result: More Languages Help Distance from [Blust, 1993] Reconstructions Mean edit distance Number of modern languages used

  33. Results: Large Scale

  34. Visualization: Learned universals • *The model did not have features encoding natural classes

  35. Regularity and Functional Load In a language, some pairs of sounds are more contrastive than others (higher functional load) Example:English “p”/“b” versus “t”/”th” “p”/“b”: pot/dot, pin/din, dress/press, pew/dew, ... “t”/”th”: thin/tin

  36. Functional Load: Timeline Functional Load Hypothesis (FLH): sounds changes are less frequent when they merge phonemes with high functional load [Martinet, 55] Previous research within linguistics: “FLH does not seem to be supported by the data” [King, 67] Caveat: only four languages were used in King’s study [Hocket 67; Surandran et al., 06] Our work: we reexamined the question with two orders of magnitude more data [BGK, under review]

  37. Each dot is a sound change identified by the system Regularity and Functional Load Data: only 4 languages from the Austronesian data Merger posterior probability Functional load as computed by [King, 67]

  38. Regularity and Functional Load Data: all 637 languages from the Austronesian data Merger posterior probability Functional load as computed by [King, 67]

  39. Outline focus Ancestral Word Forms fuego feu fuego oeuf  Cognate Groups / Translations huevo feu Grammatical Inference les faits sont très clairs

  40. Cognate Groups ‘fire’ /fwɔko/ /vɛrbo/ /tʃɛntro/ /sentro/ /berβo/ /fweɣo/ /vɛrbo/ /fogo/ /sɛntro/

  41. Model: Cognate Survival omnis + LA LA - - IB IB ogni + - - - - IT ES PT IT ES PT

  42. Results: Grouping Accuracy Fraction of Words Correctly Grouped Method [Hall and Klein, in submission]

  43. Semantics: Matching Meanings EN day tag EN Occurs with: “name” “label” “along” Occurs with: “night” “sun” “week” DE tag Occurs with: “nacht” “sonne” “woche”

  44. Outline focus Ancestral Word Forms fuego feu fuego oeuf  Cognate Groups / Translations huevo feu Grammatical Inference les faits sont très clairs

  45. Grammar Induction Task: Given sentences, infer grammar (and parse tree structures) congress narrowly passed the amended bill les faits sont très clairs

  46. Shared Prior congress narrowly passed the amended bill les faits sont très clairs

  47. Results: Phylogenetic Prior GL IE Avgrel gain: 29% G RM WG NG English Dutch Danish Swedish Spanish Portuguese Slovene Chinese

  48. Conclusion • Phylogeny-structured models can: • Accurately reconstruct ancestral words • Give evidence to open linguistic debates • Detect translations from form and context • Improve language learning algorithms • Lots of questions still open: • Can we get better phylogenies using these high-res models? • What do these models have to say about the very earliest languages? Proto-world?

  49. Thank you! nlp.cs.berkeley.edu

More Related