240 likes | 248 Views
This article explores the computational nature of language learning and evolution, focusing on convergence times for the Markov chain model, transition matrices, absorption times, and eigenvalue rates of convergence.
E N D
Ch 4. Language Acquisition:Memoryless Learning4.1 ~ 4.3The Computational Nature of Language Learning and EvolutionPartha Niyogi2004 Summarized by M.-O. Heo Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/
Contents • 4.1 Characterizing Convergence Times for the Markov Chain Model • 4.1.1 Some Transition Matrices and Their Convergence Curves • 4.1.2 Absorption Times • 4.1.3 Eigenvalue Rates of Convergence • 4.2 Exploring Other Points • 4.2.1 Changing the Algorithm • 4.2.2 Distributional Assumptions • 4.2.3 Natural Distributions – CHILDES CORPUS • 4.3 Batch Learning Upper and Lower Bounds: An Aside (C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Markov Formulation • Parameterized grammar family with 3 parameters • Target language • Absorbing State • Loop to itself • No exit arcs • Closed set of states • No arc from any states in set • Absorbing set is a closed set with one state
Markov Chain Criteria for Learnability • Gold learnable ↔ every closed set includes the target state
Some Transition Matrices and Their Convergence Curves (1/3) • Markov chain formulation for learning 3-parameter grammar using degree-0 strings in L5 • The transition matrix from Li to Lj Local maxima exist (C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Some Transition Matrices and Their Convergence Curves (2/3) • Probability matrix in the limit by running m times as m goes to infinity • If initial states are from L5 ~ L8, converge to L5 successfully to target grammar • If initial state are from L2 or L4, converge to the other maximum, L2 • If initial states are from L1 or L3, fail to converge (C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Some Transition Matrices and Their Convergence Curves (3/3) • One example without local maxima problem. • This rate allows us to bound the sample complexity. (C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Absorption Times • Absorption Time • Given an initial state, the time taken to reach the absorption state is a random variable. • If the target language is L1, transition matrix has the form: • Mean • Variance • Using the above, we can get the statistics of the absorption time from the most unfavorable initial state of the learner. (C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Eigenvalue Rates of Convergence (1/5) • For transition matrices corresponding to finite Markov chains as an eigenvalue problem • It is possible to show that λ = 1 is always an eigenvalue. Further, it is the largest eigenvalue in that any other eigenvalue is less than 1 in absolute value, i.e., | λ| < 1 • The multiplicity of the eigenvalue λ = 1 is equal to the number of closed classes in the chain. (C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Eigenvalue Rates of Convergence (2/5) • Representation of Tk • Let T be an m × m transition matrix. Let it have m linearly independent eigenvectors corresponding to eigenvalues • Define L and L-1 (C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/ SVD (Singular Value Decomposition)
Eigenvalue Rates of Convergence (3/5) • Initial Conditions and Limiting Distributions • We could quantify the initial condition of the learner by putting a distribution on the states of the Markov chain according to which the learner picks its initial state. Let this be denoted by the row vector • After k examples, the probability with which the learner would be in each of the states is given by • Limiting distribution as (C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Eigenvalue Rates of Convergence (4/5) • Rate of Convergence • This rate depends on the rate at which Tk converges to T∞ • We can bound the rate of convergence by the rate at which the second largest eigenvalue converges to 0. (C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Eigenvalue Rates of Convergence (5/5) • Transition Matrix Recipes (C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Changing the Algorithm (Variants of TLA) • Random walk with neither greediness nor single value constraints • If a new sentence analyzable, it remains in that state. • If not, the learner moves uniformly at random to any of the other states and stays there waiting for the next sentence. This is done without regard to whether the new state allows the sentence to be analyzed. • Random walk with no greediness but with single-value constraint • If a new sentence analyzable, it remains in its original state. • If not, the learner choose one of the parameters uniformly at random and flip it (moving to an adjacent state in the Markov structure). Again, this is done without regard to whether the new state allows the sentence to be analyzed. • Random walk with no single value constraint but with greediness • If a new sentence analyzable, it remains in its original state. • If not, the learner moves uniformly at random to any of the other states and stays there iff the sentence can be analyzed. If the sentence cannot be analyzed in the new state, the learner remains in its original state. (C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Reminder of TLA • Triggering learning algorithm (TLA)
Distributional Assumptions • The convergence times depend on the distribution of example data • The distribution-free convergence time for the 3-parameter system is infinite. • We can use parameterized distribution • Each of the sets A, B, C and D contain different degree-0 sentences of L1. The elements of each defined subset of L1 are equally likely w.r.t each other. • The sample complexity can not be bound in a manner that is distribution-free, because by choosing a highly unfavorable distribution the sample complexity can be made as high as possible. (C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Natural Distribution–CHILDES CORPUS • Examining the fidelity of the model using real language distribution. • CHILDES database (MacWhinney 1996) • 43,612 for English sentences • 632 for German sentences • Consider input patterns SVO, S Aux V, and so on, as appropriate for the target language. • Sentences not parsable into these patterns were discarded. (C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Discovering that convergence falls roughly along the TLA convergence time – roughly 100 examples to asymptote. • The feasibility of the basic model is confirmed by actual caretaker input, at least in this simple case, for both English and German. • One must add patterns to cover the predominance of auxiliary inversions and wh-questions. • As far as we can tell, we have not yet arrived at a satisfactory parameter-setting account for V2 acquisition. (C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Batch Learning Upper and Lower Bounds: An Aside • Consider upper and lower bounds for learning finite language families if the learner was allowed to remember all the strings encountered and optimize over them. • There are n language over an alphabet Σ. Each language can be represented as a subset of Σ* • The learner is provided with positive data drawn according to distribution P on the strings of a particular target language. • Goal: To identify the target • Q: how many samples the learner needs to see so that with high confidence it is able to identify the target. (C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
A lower bound of the number of samples to be able to identify the target. (C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/
An upper bound of the number of samples that are sufficient to guarantee identification with high confidence. (C) 2009, SNU Biointelligence Lab, http://bi.snu.ac.kr/