1 / 70

Lerende Machienekes Memory-based learning and NLP

Lerende Machienekes Memory-based learning and NLP. May 10 2006 Tilburg University Antal van den Bosch. Overview. A bit of linguistics Empiricism, analogy, induction: a lightweight historical overview Memory-based learning algorithms Machine learning of natural language issues: case studies

jack
Download Presentation

Lerende Machienekes Memory-based learning and NLP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lerende MachienekesMemory-based learning and NLP May 10 2006 Tilburg University Antal van den Bosch

  2. Overview • A bit of linguistics • Empiricism, analogy, induction: a lightweight historical overview • Memory-based learning algorithms • Machine learning of natural language issues: case studies • Representation • Forgetting exceptions • There’s no data like more data

  3. Empiricism, analogy, induction, language • A lightweight historical overview • De Saussure: Any creation [of a language utterance] must be preceded by an unconscious comparison of the material deposited in the storehouse of language, where productive forms are arranged according to their relations. (1916, p. 165)

  4. Lightweight history (2) • Bloomfield: The only useful generalizations about language are inductive generalizations. (1933, p. 20). • Zipf: nf2=k (1935), rf=k (1949)

  5. Lightweight history (3) • Harris: With an apparatus of linguistic definitions, the work of linguistics is reducible […] to establishing correlations. […] And correlations between the occurrence of one form and that of other forms yield the whole of linguistic structure. (1940) • Hjelmslev: Induction leads not to constancy but to accident. (1943)

  6. Lightweight history (4) • Firth: A [linguistic] theory derives its usefulness and validity from the aggregate of experience to which it must continually refer. (1952, p. 168) • Chomsky: I don't see any way of explaining the resulting final state [of language learning] in terms of any proposed general developmental mechanism that has been suggested by artificial intelligence, sensorimotor mechanisms, or anything else. (in Piatelli-Palmarini, 1980, p. 100)

  7. Lightweight history (5) • Halliday: The test of a theory [on language] is: does it facilitate the task at hand? (1985) • Altmann: After the blessed death of generative linguistics, a linguist does no longer need to find a competent speaker. Instead, he needs to find a competent statistician. (1997)

  8. ML and Natural Language • Apparent conclusion: ML could be an interesting tool to do linguistics • Next to probability theory, information theory, statistical analysis (natural allies) • “Neo-Firthian” linguistics • More and more annotated data available • Skyrocketing computing power and memory

  9. Analogical memory-based language processing • With a memory filled with instances of language mappings • from text to speech, • from words to syntactic structure, • from utterances to acts, … • With the use of analogical reasoning, • Process new instances from input • text, words, utterances to output • speech, syntactic structure, acts

  10. Analogy (1) sequence a’ sequence b’ is similar to maps to maps to is similar to sequence a sequence b

  11. Analogy (2) ? sequence a’ is similar to maps to is similar to sequence a sequence b

  12. Analogy (3) sequence n’ ? sequence f’ sequence a’ are similar to map to sequence a are similar to sequence f sequence b sequence n

  13. Memory-based parsing zo werd het Grand een echt theater … zoMOD/Swordt er … … zogaatHD/Shet … … en dan werd[NP hetDET<*> dus … … dan is het <*Naam>HD/SUBJNP]bijna erger … … ergens ene keer[NP eenDETecht <*> … … ben ik eenechtMOD<*> maar … … een echtbedrijfHD/PREDCNP ] zoMOD/S werdHD/S [NP hetDET GrandHD/SUBJNP] [NP eenDET echtMOD theaterHD/PREDCNP]

  14. CGN treebank

  15. Make data (1) #BOS 54 2 1011781542 0 zo BW T901 MOD 502 werd WW1 T304 HD 502 het VNW3 U503b DET 500 Grand*v N1 T102 HD 500 een LID U608 DET 501 echt ADJ9 T227 MOD 501 theater N1 T102 HD 501 . LET T007 -- 0 #500 NP -- SU 502 #501 NP -- PREDC 502 #502 SMAIN -- -- 0 #EOS 54

  16. Make data (2) • Given context, map individual words to function+chunk code: • zo MODO • werd HDO • het DETB-NP • Grand HD/SUI-NP • een DETB-NP • echt MOD I-NP • theater HD/PREDCI-NP

  17. Make data (3) • Generate instances with context: • _ _ _ zo werd het Grand MOD-O • _ _ zo werd het Grand een HD-O • _ zo werd het Grand een echt DET-B-NP • zo werd het Grand een echt theaterHD/SU-I-NP • werd het Grand een echt theater _ DET-B-NP • het Grand een echt theater _ _ MOD-I-NP • Grand een echt theater _ _ _ HD/PREDC-I-NP

  18. Empirical ML: 2 Flavours • Greedy • Learning • abstract model from data • Classification • apply abstracted model to new data • Lazy • Learning • store data in memory • Classification • compare new data to data in memory

  19. Greedy learning

  20. Greedy learning

  21. Lazy Learning

  22. Lazy Learning

  23. Greedy: Decision tree induction CART, C4.5 Rule induction CN2, Ripper Hyperplane discriminators Winnow, perceptron, backprop, SVM Probabilistic Naïve Bayes, maximum entropy, HMM (Hand-made rulesets) Lazy: k-Nearest Neighbour MBL, AM Local regression Greedy vs Lazy Learning

  24. Greedy vs Lazy Learning + abstraction Decision Tree Induction Hyperplane discriminators Regression Handcrafting + generalization - generalization Memory-Based Learning Table Lookup - abstraction

  25. Greedy vs Lazy: So? • Highly relevant to ML of NL • In language data, what is core? What is periphery? • Often little or no noise; productive exceptions • (Sub-)subregularities, pockets of exceptions • “disjunctiveness” • Some important elements of language have different distributions than the “normal” one • E.g. word forms have a Zipfian distribution

  26. TiMBL • Tilburg Memory-Based Learner • Available for research and education • Lazy learning, extending k-NN and IB1 • Roots in pattern recognition: • k-NN classifier (Fix & Hodges, 1951; Cover & Hart, 1967) • Rediscovered in AI / ML: • Stanfill & Waltz, 1986 • IB1 (Aha, Kibler, & Albert, 1991) • A.k.a. SBR, EBG, CBR, local learning, …

  27. Memory-based learning and classification • Learning: • Store instances in memory • Classification: • Given new test instance X, • Compare it to all memory instances • Compute a distance between X and memory instance Y • Update the top k of closest instances (nearest neighbors) • When done, take the majority class of the k nearest neighbors as the class of X

  28. Similarity / distance • A nearest neighbor has the smallest distance, or the largest similarity • Computed with a distance function • TiMBL offers two basic distance functions: • Overlap • MVDM (Stanfill & Waltz, 1986; Cost & Salzberg, 1989) • Feature weighting • Exemplar weighting • Distance-weighted class voting

  29. The Overlap distance function • “Count the number of mismatching features”

  30. The MVDM distance function • Estimate a numeric “distance” between pairs of values • “e” is more like “i” than like “p” in a phonetic task • “book” is more like “document” than like “the” in a parsing task

  31. Feature weighting • Some features are more important than others • TiMBL metrics: Information Gain, Gain Ratio, Chi Square, Shared Variance • Ex. IG: • Compute data base entropy • For each feature, • partition the data base on all values of that feature • For all values, compute the sub-data base entropy • Take the weighted average entropy over all partitioned subdatabases • The difference between the “partitioned” entropy and the overall entropy is the feature’s Information Gain

  32. Feature weighting: IG

  33. Feature weighting: IG • Extreme examples of IG • Suppose data base entropy of 1.0 • Uninformative feature will have partitioned entropy of 1.0 (nothing happens), so a gain of 0.0 • Informative feature will have 0.0, so a gain of 1.0

  34. Entropy & IG: Formulas

  35. Feature weighting in the distance function • Mismatching on a more important feature gives a larger distance • Factor in the distance function:

  36. Exemplar weighting • Scale the distance of a memory instance by some externally computed factor • Smaller distance for “good” instances • Bigger distance for “bad” instances

  37. Distance weighting • Relation between larger k and smoothing • Subtle extension: making more distant neighbors count less in the class vote • Linear inverse of distance (w.r.t. max) • Inverse of distance • Exponential decay

  38. Current practice • Default TiMBL settings: • k=1, Overlap, GR, no distance weighting • Work well for some morpho-phonological tasks • Rules of thumb: • Combine MVDM with bigger k • Combine distance weighting with bigger k • Very good bet: higher k, MVDM, GR, distance weighting • Especially for sentence and text level tasks

  39. Representation: Who to believe? • http://ilk.uvt.nl/~antalb/ltua/week2.html

  40. Forgetting exceptions is harmful • http://lw0164.uvt.nl/~antalb/acl99tut/day4.html

  41. There’s no data like more data

  42. Overview • The More Data effect • Intermezzo: The k-NN classifier • Case study 1: learning curves and feature representations • Intermezzo: paramsearch • Case study 2: continuing learning curves with more data

  43. The More Data effect • There’s no data like more data (speech recognition motto) • Banko and Brill (2001): confusibles • Differences between algorithms flip or disappear • Differences between representations disappear • Growth of curve seems log-linear (constant improvement with exponentially more data) • Explanation sought in “Zipf’s tail”

  44. Banko and Brill (2001) • Demonstrated on {to,two,too} using 1M to 1G examples: • Initial range between 3 classifiers at • 1M: 83-85% • 1G: 96-97% • Extremely simple memory-based classifier (one word left, one word right): • 86% at 1M, 93% at 1G • apparent constant improvement on log-growth

  45. Zipf • Frequency of nth most frequent word is inversely proportional to n • ~ log-linear relation between token frequencies vs numbers of types that have these frequencies

  46. WSJ, first 1,000 words

  47. WSJ, first 2,000 words

  48. WSJ, first 10,000 words

  49. WSJ, first 20,000 words

  50. WSJ, first 50,000 words

More Related