Lerende Machienekes Memory-based learning and NLP

Lerende MachienekesMemory-based learning and NLP May 10 2006 Tilburg University Antal van den Bosch

Overview • A bit of linguistics • Empiricism, analogy, induction: a lightweight historical overview • Memory-based learning algorithms • Machine learning of natural language issues: case studies • Representation • Forgetting exceptions • There’s no data like more data

Empiricism, analogy, induction, language • A lightweight historical overview • De Saussure: Any creation [of a language utterance] must be preceded by an unconscious comparison of the material deposited in the storehouse of language, where productive forms are arranged according to their relations. (1916, p. 165)

Lightweight history (2) • Bloomfield: The only useful generalizations about language are inductive generalizations. (1933, p. 20). • Zipf: nf2=k (1935), rf=k (1949)

Lightweight history (3) • Harris: With an apparatus of linguistic definitions, the work of linguistics is reducible […] to establishing correlations. […] And correlations between the occurrence of one form and that of other forms yield the whole of linguistic structure. (1940) • Hjelmslev: Induction leads not to constancy but to accident. (1943)

Lightweight history (4) • Firth: A [linguistic] theory derives its usefulness and validity from the aggregate of experience to which it must continually refer. (1952, p. 168) • Chomsky: I don't see any way of explaining the resulting final state [of language learning] in terms of any proposed general developmental mechanism that has been suggested by artificial intelligence, sensorimotor mechanisms, or anything else. (in Piatelli-Palmarini, 1980, p. 100)

Lightweight history (5) • Halliday: The test of a theory [on language] is: does it facilitate the task at hand? (1985) • Altmann: After the blessed death of generative linguistics, a linguist does no longer need to find a competent speaker. Instead, he needs to find a competent statistician. (1997)

ML and Natural Language • Apparent conclusion: ML could be an interesting tool to do linguistics • Next to probability theory, information theory, statistical analysis (natural allies) • “Neo-Firthian” linguistics • More and more annotated data available • Skyrocketing computing power and memory

Analogical memory-based language processing • With a memory filled with instances of language mappings • from text to speech, • from words to syntactic structure, • from utterances to acts, … • With the use of analogical reasoning, • Process new instances from input • text, words, utterances to output • speech, syntactic structure, acts

Analogy (1) sequence a’ sequence b’ is similar to maps to maps to is similar to sequence a sequence b

Analogy (2) ? sequence a’ is similar to maps to is similar to sequence a sequence b

Analogy (3) sequence n’ ? sequence f’ sequence a’ are similar to map to sequence a are similar to sequence f sequence b sequence n

Memory-based parsing zo werd het Grand een echt theater … zoMOD/Swordt er … … zogaatHD/Shet … … en dan werd[NP hetDET<*> dus … … dan is het <*Naam>HD/SUBJNP]bijna erger … … ergens ene keer[NP eenDETecht <*> … … ben ik eenechtMOD<*> maar … … een echtbedrijfHD/PREDCNP ] zoMOD/S werdHD/S [NP hetDET GrandHD/SUBJNP] [NP eenDET echtMOD theaterHD/PREDCNP]

CGN treebank

Make data (1) #BOS 54 2 1011781542 0 zo BW T901 MOD 502 werd WW1 T304 HD 502 het VNW3 U503b DET 500 Grand*v N1 T102 HD 500 een LID U608 DET 501 echt ADJ9 T227 MOD 501 theater N1 T102 HD 501 . LET T007 -- 0 #500 NP -- SU 502 #501 NP -- PREDC 502 #502 SMAIN -- -- 0 #EOS 54

Make data (2) • Given context, map individual words to function+chunk code: • zo MODO • werd HDO • het DETB-NP • Grand HD/SUI-NP • een DETB-NP • echt MOD I-NP • theater HD/PREDCI-NP

Make data (3) • Generate instances with context: • _ _ _ zo werd het Grand MOD-O • _ _ zo werd het Grand een HD-O • _ zo werd het Grand een echt DET-B-NP • zo werd het Grand een echt theaterHD/SU-I-NP • werd het Grand een echt theater _ DET-B-NP • het Grand een echt theater _ _ MOD-I-NP • Grand een echt theater _ _ _ HD/PREDC-I-NP

Empirical ML: 2 Flavours • Greedy • Learning • abstract model from data • Classification • apply abstracted model to new data • Lazy • Learning • store data in memory • Classification • compare new data to data in memory

Greedy learning

Lazy Learning

Greedy: Decision tree induction CART, C4.5 Rule induction CN2, Ripper Hyperplane discriminators Winnow, perceptron, backprop, SVM Probabilistic Naïve Bayes, maximum entropy, HMM (Hand-made rulesets) Lazy: k-Nearest Neighbour MBL, AM Local regression Greedy vs Lazy Learning

Greedy vs Lazy Learning + abstraction Decision Tree Induction Hyperplane discriminators Regression Handcrafting + generalization - generalization Memory-Based Learning Table Lookup - abstraction

Greedy vs Lazy: So? • Highly relevant to ML of NL • In language data, what is core? What is periphery? • Often little or no noise; productive exceptions • (Sub-)subregularities, pockets of exceptions • “disjunctiveness” • Some important elements of language have different distributions than the “normal” one • E.g. word forms have a Zipfian distribution

TiMBL • Tilburg Memory-Based Learner • Available for research and education • Lazy learning, extending k-NN and IB1 • Roots in pattern recognition: • k-NN classifier (Fix & Hodges, 1951; Cover & Hart, 1967) • Rediscovered in AI / ML: • Stanfill & Waltz, 1986 • IB1 (Aha, Kibler, & Albert, 1991) • A.k.a. SBR, EBG, CBR, local learning, …

Memory-based learning and classification • Learning: • Store instances in memory • Classification: • Given new test instance X, • Compare it to all memory instances • Compute a distance between X and memory instance Y • Update the top k of closest instances (nearest neighbors) • When done, take the majority class of the k nearest neighbors as the class of X

Similarity / distance • A nearest neighbor has the smallest distance, or the largest similarity • Computed with a distance function • TiMBL offers two basic distance functions: • Overlap • MVDM (Stanfill & Waltz, 1986; Cost & Salzberg, 1989) • Feature weighting • Exemplar weighting • Distance-weighted class voting

The Overlap distance function • “Count the number of mismatching features”

The MVDM distance function • Estimate a numeric “distance” between pairs of values • “e” is more like “i” than like “p” in a phonetic task • “book” is more like “document” than like “the” in a parsing task

Feature weighting • Some features are more important than others • TiMBL metrics: Information Gain, Gain Ratio, Chi Square, Shared Variance • Ex. IG: • Compute data base entropy • For each feature, • partition the data base on all values of that feature • For all values, compute the sub-data base entropy • Take the weighted average entropy over all partitioned subdatabases • The difference between the “partitioned” entropy and the overall entropy is the feature’s Information Gain

Feature weighting: IG

Feature weighting: IG • Extreme examples of IG • Suppose data base entropy of 1.0 • Uninformative feature will have partitioned entropy of 1.0 (nothing happens), so a gain of 0.0 • Informative feature will have 0.0, so a gain of 1.0

Entropy & IG: Formulas

Feature weighting in the distance function • Mismatching on a more important feature gives a larger distance • Factor in the distance function:

Exemplar weighting • Scale the distance of a memory instance by some externally computed factor • Smaller distance for “good” instances • Bigger distance for “bad” instances

Distance weighting • Relation between larger k and smoothing • Subtle extension: making more distant neighbors count less in the class vote • Linear inverse of distance (w.r.t. max) • Inverse of distance • Exponential decay

Current practice • Default TiMBL settings: • k=1, Overlap, GR, no distance weighting • Work well for some morpho-phonological tasks • Rules of thumb: • Combine MVDM with bigger k • Combine distance weighting with bigger k • Very good bet: higher k, MVDM, GR, distance weighting • Especially for sentence and text level tasks

Representation: Who to believe? • http://ilk.uvt.nl/~antalb/ltua/week2.html

Forgetting exceptions is harmful • http://lw0164.uvt.nl/~antalb/acl99tut/day4.html

There’s no data like more data

Overview • The More Data effect • Intermezzo: The k-NN classifier • Case study 1: learning curves and feature representations • Intermezzo: paramsearch • Case study 2: continuing learning curves with more data

The More Data effect • There’s no data like more data (speech recognition motto) • Banko and Brill (2001): confusibles • Differences between algorithms flip or disappear • Differences between representations disappear • Growth of curve seems log-linear (constant improvement with exponentially more data) • Explanation sought in “Zipf’s tail”

Banko and Brill (2001) • Demonstrated on {to,two,too} using 1M to 1G examples: • Initial range between 3 classifiers at • 1M: 83-85% • 1G: 96-97% • Extremely simple memory-based classifier (one word left, one word right): • 86% at 1M, 93% at 1G • apparent constant improvement on log-growth

Zipf • Frequency of nth most frequent word is inversely proportional to n • ~ log-linear relation between token frequencies vs numbers of types that have these frequencies

WSJ, first 1,000 words

Lerende Machienekes Memory-based learning and NLP

Lerende Machienekes Memory-based learning and NLP

Presentation Transcript

Learning and Memory

Learning and Memory

Learning and Memory

Learning and memory

Learning and Memory

Learning and Memory

Learning and Memory

Learning and Memory

Learning and Memory

Distributed Representation, Connection-Based Learning, and Memory

Memory and Learning

Learning and Memory

Learning and Memory

Learning and Memory

Learning and Memory

Memory and learning

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

Learning and memory

Learning and Memory

Distributed Representation, Connection-Based Learning, and Memory