370 likes | 610 Views
통계적 자연어 처리. 포항공대 자연어처리 연구실 1999.8.27 튜토리알 이근배. contents. statistical vs. structured NLP statistics for computational linguistics POS tagging PCFG parsing other applications conclusion. references. Eugene Charniak. Statistical Language Learning. MIT press, 1993
E N D
통계적 자연어 처리 포항공대 자연어처리 연구실 1999.8.27 튜토리알 이근배
contents • statistical vs. structured NLP • statistics for computational linguistics • POS tagging • PCFG parsing • other applications • conclusion Postech NLP lab
references • Eugene Charniak. Statistical Language Learning. MIT press, 1993 • Christopher D. Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing, MIT press, 1999. • (basic and applied statistics) • - BrigitteKrenn and ChristerSamuelsson. The Linguist's Guide to Statistics. Internet shareware, http://www.coli.uni-sb.de/~krenn/edu.html • (general) • - Abney, S. Statistical methods and linguistics. In J. Klavans and P. Resnik. The Balancing act, MIT Press, 1996 • - Church and Mercer, Introduction to the special issue on computational linguistics using large corpora, Computational Lingusitcs, 19, 1993 • (POS tagging) • - Cutting etal. A practical part-of-speech tagger, In Proceedings of the 3rd conference on applied natural language processing, 1992. • - Church. A stochastic parts program and noun phrase parser for unrestricted text. in Proceedings of the 2nd conference on applied natural language processing, 1988 • - Weischedelet. al. Coping with ambiguity and unknown words through probablistic models, Computational linguistics, 19(2), 1993 • - J. Kupiec. Robust part-of-speech tagging using a hidden Markov model, computer speech and language, 6, 1992 • - E. Brill. A simple rule-based part-of-speech tagger. Proceedings of the 3rd conference on applied NLP, 1992 Postech NLP lab
references • - E. Roche and Y. Schabes. Deterministic part-of-speech tagging with finite state transducers. Computational linguistics 21, 1995. • - B. Merialdo. Tagging English text with a probabilistic model. Computational linguistics 20, 1994. • (Statistical parsing) • - K. Lari and S. Young. The estimation of stochastic context-free grammar using the inside-outside • algorithm. Computer speech and language 4, 1990 • - F. Pereira and Y. Schabes. Inside-outside reestimation frm partially bracketed corpora. ACL 30, 1992 • - T. Briscoe and J. Carroll. Generalized probabilistic LR parsing of natural language (corpora) with unification-based grammars. Computational linguistics 19, 1993. • - E. Black et. al. Towards history-based grammars: using richer models for probablistic parsing. ACL 31, 1993. • - D. Margerman. Statistical decision-tree models for parsing, ACL 33, 1995. • - Brill. Automatic grammar induction and parsing free text: a transformation-based approach. ACL 31, 1993 Postech NLP lab
refereces • (statistical disambiguation) • - D. Hindle and M. Rooth. Structural ambiguity and lexical relations. Computational linguistics 19, 1993 • - K. Church and P. Hanks. Word association norms, mutual information and lexicography, ACL 28, 1990 • - Alshawi and Carter. Training and scaling preference functions for disambiguation. computational lingustics 20(4), 1994. • (word classes and WSD) • - Gale et. al. Work on statistical methods for word-sense disambiguation, Proceedings from AAAI fall symposium: Probablistic approaches to natural language, 1992. • - Gale etal. Estimating upper and lower bounds on the performance of word-sense disambiguation programs. ACL 30, 1992. • - Yarowsky. Word sense disambiguation using statistical method of Roget's categories trained on large copora, Coling, 1992. • - Yarowsky. Unsupervised word-sense disambiguation rivaling supervised methods, ACL 33, 1995 • - Pereiraet. al. Distributional clustering of English words, ACL 31, 1993 • - Daganetal. Contextual word similarity and estimation from sparse data, ACL 31, 1993 • - Daganetal. Similarity-based estimation of word coocurrenceprobabilites, ACL 32, 1994. Postech NLP lab
references • (text alignment and machine translation) • - Kay and Roscheisen. Text-translation alignment. computational linguistics 19, 1993 • - Gale and Church. A program for aligning sentences in bilingual corpora, Computational linguistics 19, 1993 • - Brown etal. The mathematics of statistical machine translation: parameter estimation. computational linguistics, 1993 • - Brown etal. A statiscal approach to machine translation. Computational linguistics 16, 1990 • - Wu. Aligning a parallel English-Chinese corpus statistically with lexical criteria. ACL 32, 1994 • - Church. Char_align: A program for aligning parall디 texts at the character level. ACL 31, 1993 • - Sproatetal. A stochstic finite-state word segmentation algorithm for Chinese, ACL 32, 1994 • (lexical knowledge acquisition) • - Manning. Automatic acquistion of a large subcategorization dictionary from corpora. ACL 31, 1993 • - Smadja. Retrieving collocations from text: Xtract, Computational linguistics, 1993 • (speech and others) • - Brown etal. Class-based n-gram models of natural language, computational linguistics, 18(4), 1992 Postech NLP lab
구조냐 통계냐 - 역사의 시계추 Statistical analysis data driven empirical connectionist speech community Structural analysis rule driven rational symbolic NLU, Chomskian, Shankian, AI community Postech NLP lab
구조적 NLP • grammar rules + lexicons • Grammatical category (POS, syntactic category) • unification features (connectivity, agreements, semantics..) • chart parsing • compositional semantics • 한계: 엄청난 ambiguity • “List the sales of the products produced in 1973 with the products produced in 1972” ==> 455 parses (Martin et. al. 1981) Postech NLP lab
통계적 NLP • 구문단계의 문법의 역할 -- 어떤 단어열이 올바른 문장이 되나? • Pr (w1, w2, …wn) 계산하기 • pr(w1)pr(w2|w1)pr(w3|w1w2)…..pr(wn|w1, ….wn-1) • pr(w2 |w1) = count (w1w2) / count (w1) [MLE] • 예) the (big, pig) dog • Shannon game -- predicting the next word given word sequence • language modeling -- probability matrix • 언어모델 평가 -- cross entropy 개념 적용 • - pr(w1,n) log prM(w1,n) • when prM(w1,n) = pr(w1,n) cross entropy becomes minimum and language model M is perfect Postech NLP lab
Chomsky vs. Shannon • What Shannon says • for n-th order probability approximation of sentence s • grammatical (s) <--> • What Chomsky says • there is no way to choose n and such that • for all sentence s, grammatical (s) <--> • What’s wrong with n-th order Markov model? • probability is ok but finite-state machine is not Postech NLP lab
contents • statistical vs. structured NLP • statistics for computational linguistics • POS tagging • PCFG parsing • other applications • conclusion Postech NLP lab
저는 통계를 하나도 모르는데요? • Bayesian inversion formula • p(a | b) = p(a).p(b|a) / p(b) • Maximum-likelihood estimation (MLE) • 관찰된 결과가 나올 확률이 최대로 되게 • max L(x1, x2, …, xn, ) ==> relative frequency e.g. pr(w2 |w1) = count (w1w2) / count (w1) • cf. MAP estimation, Bayesian estimation, function approximation using NN • smoothing (discounting) required : adding one, held-out, deleted-estimation, good-turing, Charniak’s linear interpolation, Katz’s backing-off, etc • Random variable • rv: (sample space) --> R (real number) • random process (stochastic process) • characterized by p(xt+1 | x1, x2, ….xt) Postech NLP lab
아직도 모르겠는데요? • Markov chain: special stochastic process • p(xt+1 | x1, x2, ….xt) = p(xt+1| xt) : transition matrix • Markov model • state transition matrix pij = p(sj | si) <- 1st Markov assumption • signal matrix aij = p(oj | si) <--2nd Markov assumption • initial state vector vi= p(si) • hidden markov model • state sequence hidden • can only observe signal sequence Postech NLP lab
좀 알것 같아요 • Entropy • discrete r.v. with p(xi) - how much uncertainty by not knowing the outcome of r.v. • 매 특정 결과를 코딩하는 데 드는 비트수의 평균값 • E[ - log2 P(r ) ] = i -pi log2 pi • uniform distribution 일때 최대 • joint entropy/ conditional entropy • mutual information • MI [r1, r2] = E [log p(r1, r2) / p(r1) p(r2)] = H[r1] - H[r1|r2] • MI >> 0 strong correlation • MI << 0 strong negative correlation Postech NLP lab
이게 전부예요? • Perplexity • perp[r] = e H[r] = branching factor in the word sequence • cross entropy -- how good is your model? • Hp[q] = x p(x) ln q(x) (min when p = q) • relative entropy (KL distance; KL divergence) • distance measure between two distributions p, q • D[p||q] = Hp[q] - H[p] >0 • information radius(Irad) D[p||(p+q)/2] + D[q||(p+q)/2] best divergence measures between probability distribution Postech NLP lab
Basic corpus linguistics • Empirical evaluation • black box evaluation: system test as a whole • glass box evaluation: component-wise test • need designed test material + annotated copora + evaluation measure Postech NLP lab
Basic corpus linguistics • Contingency table measure • recall = a/ a+c ; for completeness • precision = a / a+b ; for correctness • fallout = b/ b+d Postech NLP lab
Basic corpus linguistics • Corpora according to text type • balanced corpora (e.g. Brown corpus) • pyramidal corpora: large sample of few representative genres to small sample of wide variety of genres • opportunistic corpora • corpora according to annotation type • raw corpus: tokenized/cleaned/meta tagged • pos-tagged • tree banks • sense tagged • corpora according to use • training • testing/evaluation • cross validation (10-fold) Postech NLP lab
Basic corpus linguistics • Zipf’s law • f 1 / r (f: frequency of a word, r: rank (position) of the word) • principle of least efforts for both speaker (small frequent vocabulary) and listener (large vocabulary of rarer words for less ambiguities) • Only a few words will have enough examples ==> always needs smoothing! • Collocations : the whole is beyond the sum of the parts cf. Co-occurrence: no-order constraints (=association) • compound (e.g. disk drive) • phrasal verbs (e.g. make up) • stock phrases (e.g. bacon and eggs) • idioms (e.g. kick the bucket) • several words long and can be discontinuous • measure: variance, t-test, x2-test, likelihood ratio, MI, etc • KWIC (key word in contexts) Postech NLP lab
contents • statistical vs. structured NLP • statistics for computational linguistics • POS tagging • PCFG parsing • other applications • conclusion Postech NLP lab
POS tagging • 포항공대 이근배 교수님께서 신을 신고신고하러 가신다. [ 0,0 ( 0,0 )] 등 1.000000e+00(1.000000e+00) s<문장시작>([) [ 1,10( 1,1 )]미 8.288423e-11(6.102822e-13) MPO<포항공대>(포항공대) [11,11( 2,2 )] 등 8.736421e-02(2.559207e-20) s<#>(#) [12,18( 3,3 )]미 9.236515e-08(7.008548e-24) MPN<이근배>(이근배) [19,19( 4,4 )] 등 8.736421e-02(2.939022e-31) s<#>(#) [20,23( 5,5 )] 등 4.469725e+00(1.564634e-25) MC<교수>(교수) [24,26( 6,6 )] 등 1.373613e+02(1.504397e-25) -<님>(님) [27,30( 7,7 )] 등 1.307859e+01(1.831031e-25) jC<이>(께서) [31,31( 8,8 )] 등 8.736421e-02(7.678394e-33) s<#>(#) [32,34( 9,9 )] 등 3.250709e+00(3.667919e-27) MC<신>(신) [35,37(10,10)] 등 1.264760e+01(3.865534e-27) jC<을>(을) [38,38(11,11)] 등 8.736421e-02(1.621005e-34) s<#>(#) [39,41(12,12)] 등 5.807344e+00(1.021970e-28) DR<신>(신) [42,43(13,13)] 등 3.936314e+01(1.918250e-28) eCC<고>(고) [44,44(14,14)] 등 8.736421e-02(8.044147e-36) s<#>(#) [45,49(15,15)] 등 8.588220e-04(1.297090e-33) MC<신고>(신고) [50,51(16,16)] 등 2.626376e+01(1.404345e-33) y<하>(하) [52,56(17,19)] 등 1.445488e+03(1.043073e-31) eCC<러>(러) [52,56(17,19)] 등 1.445488e+03(1.043073e-31) s<#>(#) [52,56(17,19)] 등 1.445488e+03(1.043073e-31) DI<가>(가) [57,58(20,20)] 등 4.657808e+01(1.348953e-31) eGS<시>(시) [59,61(21,21)] 등 1.841659e+01(4.754894e-31) eGE<는다>(ㄴ다) [62,64(22,22)] 등 1.250000e-07(1.365400e-38) s.<.>(.) [65,65(23,23)] 등 2.500000e-05(1.638481e-49) s<문장끝>(]) Postech NLP lab
POS tagging • Task: finding argmax t1,n p(w1,n , t1,n) • HMM modeling • states -- tag, signals -- words • cf. ME(max entropy) modeling/ NN(neural net) modeling • 3 problems of HMM modeling given the sequence of signals (o1,n) and states (s1,n) • estimate the signal sequence pr(o1,n) ==> e.g. language identification/language modeling • determine the most probable state seqeunce argmax s1,n p(o1,n , s1,n) ==> e.g POS tagging, speech recognition • determine the model parameters (P, A, v) for given signal sequence ==> HMM training (MLE) Postech NLP lab
POS tagging • Task: finding argmax t1,n p(w1,n , t1,n) • argmax t1,n p(wi | ti) p(ti+1 | ti) • two markov assumption after chain rule conditionalization • Viterbi algorithm: time-synchronous dynamic programming di(t+1) =dj(t).si where j = [argmaxkp(dk(t)) *p(si|sk)]*p(wt+1|si) d: best state(tag) sequence ending with si s: state (tag) w: word sk si state sj T T+1 T+2 time Postech NLP lab
POS tagging • Training POS tagger • r1(i) = prob of starting in state si • t et (i,j) = expected number of transition from state si to state sj • t rt(i) = expected number of transition from state si • twj rt(i) = expected number of times word wj is emitted in state si vi = r1(i) (initial state vector) pij = t et (i,j) / t rt(i) (transition matrix) aij = twj rt(i) / t rt(i) (observation matrix) Tagged corpus (supervised): frequency count (markov model, not HMM) raw corpus (unsupervised): EM algorithm (Baum-Welch re-estimation) Postech NLP lab
POS tagging • Problems of HMM training (using EM algorithm = MLE) • critical points (never move) -- use random noise • over-fitting to the training data • local minimum • Calculation of p(w1,n) • define ai(t): prob of ending in state si emiting w1,t-1 (forward variable) • define bi(t) : prob of seeing wt,n if the state is si at time t (backward variable) aj(t+1) = [i ai(t) p(sj|si)]p(wt|sj) p(w1,n) = i ai(n+1) Postech NLP lab
contents • statistical vs. structured NLP • statistics for computational linguistics • POS tagging • PCFG parsing • other applications • conclusion Postech NLP lab
PCFG parsing • P(w1,n) = P(t1,n) : t1,n = parse tree covering w1 -- wn • p(t1,n) = p(rule) : using sub-tree independence assumption Njk,l wk wl Postech NLP lab
PCGF parsing • Why pcfg • ordering of the parsers (structural ambiguity) • accounts for grammar induction (with only positive examples) • compatible with lexical language modeling • the green banana • pcfg n--> banana/ n--> time/ n--> number …. • Trigram-->the green (time/number/banana) • Fred watered his mother’s small garden • trigram -->p(garden|mother’s small) • pcfg--> p(x=garden|x head of DO of “to water”) Postech NLP lab
PCFG parsing • Pcfg vs HMM (prob. Regular grammar) ==> different the way they assign probability • in pcfg: s p(s) = 1 (s: sentences) • in hmm: w1,n p(w1,n) = 1 (w1,n: sentence of length n) • p1 : Alice went to the ____. • p2: Alice went to the office. • in hmm p1 > p2 • in pcfg p1 < p2 Postech NLP lab
PCFG parsing • Finding p(w1,n) bj(k,l) = p(wk,l | Njk,l):inside prob (similar to backward prob) aj(k,l) = p(w1,k-1, Njk,l,wl+1,n): outside prob(similar to forward prob) N1 Njk,l Nj w1…..wk-1wk…wlwl+1…..wn wk wl Postech NLP lab
PCFG parsing • Finding p(w1,n) using inside probability • cf. Finding most likely parse for a sentence • b1(1,n) = p(w1,n | N11,n) = p(w1,n) • For Chomsky normal form bj(k,k) = p(Nj --> wk) bj(k,l) = p,q,m p(Nj-->NpNq) bp(k,m) bq(m+1,l) Nj Np Nq wk wm wm+1 wl Postech NLP lab
PCFG parsing • Training PCFG • inside-outside re-estimation (MLE training a la Baum-Welch) • For Chomsky normal form pe(Ni-->sj) = c(Ni-->sj) / kc(Ni-->sk) (sj: sentential form j) Nk,lj c(Nj-->NpNq) = 1/p(w1,n) k,l,maj(k,l) p(Nj-->NpNq) bp(k,m) bq(m+1,l) c(Ni--> wj) = 1/p(w1,n) kai(k,k) p(Ni--> wj, wj =wk) Np Nq wk wm wm+1 wl Postech NLP lab
contents • statistical vs. structured NLP • statistics for computational linguistics • POS tagging • PCFG parsing • other applications • conclusion Postech NLP lab
Other applications • Local syntactic disambiguation • PP attachment problems • She ate the soup with the spoon (n2) • p(A =n1 | prep, v, n1) > p(A=v | prep, v, n1) • relative-clause attachement • Fred awarded a prize to the dog and Bill who trained it • noun/noun, adjective/noun combination ambiguity • song bird feeder kit • metal bird feeder kit • novice bird feeder kit Postech NLP lab
Other applications • Word clustering • n-dim vector --> distance metric --> clustering algorithm • n-dim vector (feature): collocation, verb-noun case role • distance metric: mutual information, relative entropy • WSD (word-sense disambiguation) • sense tagging for polysemous, homonymous words • contextual properties of a target word --> n-dim vector • supervised method • use sense tagged corpus (or bi-lingual parallel corpus) • find argmaxs p(s) x p(x | s), [x in c(wi): mutually independent] • (s: a sense, c(wi) local context of word wi) • unsupervised method • dictionary-based, thesaurus-based, one-sense-per-collocation, local syntax-based, sense discrimination (clustering) Postech NLP lab
Other applications • Text alignment • align list of pairs (word, sentence, paragraph) from two texts • solution: subset of cartesian product • Statistical machine translation • assign a prob to every pair of the sentences • find argmaxs p(s | t) = argmax s p(s, t) (s: sourcel, t: target) Translation model p(t|s) Source language model p(s) Decoder argmax s p(s|t) S S’ T Postech NLP lab
결론: 구조냐 통계냐 - Hybrid NLP Hybrid: structural (linguistic theory; prior knowledge) + statistical preference Statistical analysis data driven empirical connectionist speech community Structural analysis rule driven rational symbolic NLU, Chomskian, Shankian, AI community Postech NLP lab