CS730b 김병창 NLP Lab. 1998. 10. 29

Inside-outside reestimation from partially bracketed corpora F. Pereira and Y. Schabes ACL 30, 1992 CS730b 김병창 NLP Lab. 1998. 10. 29

Contents • Motivation • Partially Bracketed Text • Grammar Reestimation • The Inside-Outside Algorithm • The Extended Algorithm • Complexity • Experimental Evaluation • Inferring the Palindrome Language • Experiments on the ATIS Corpus • Conclusions and Further Work NLP Lab., POSTECH

Motivation I • Very simple method for learning SCFGs [Charniak] • Generate all possible SCFG rules • Assign some initial probabilities • Run the training algorithm on a sample text  raw text • remove those rules with zero probabilities • Difficulties in using SCFGs • Time complexity - O(n3|w|3) • n : the number of non-terminals w : training sentence • cf. O(s2|w|) : training an HMM with s states • Bad convergence properties • The larger number of non-terminals, the worse. • Inferred only by chance NLP Lab., POSTECH

Motivation II • Extension of the Inside-Outside algorithm • Inferring grammars from a partially parsed corpus • Advantages • constituent boundary information in grammar • reduced number of iteration for training • better time complexity NLP Lab., POSTECH

Partially Bracketed Text • Example • (((VB(DT NNS(IN((NN)(NN CD)))))).) • (((List (the fares(for((flight)(number 891)))))).) • Notations • Corpus C = { c | c = ( w, B) }, w : string, B : bracketing of w • w=w1w2wiwi+1  wj  w|w| • (i,j) delimits iwj • consistent : no overlapping in a bracketing • compatible : union of two bracketing is consistent • valid : a span is compatible with a bracketing • span in derivation 01m=w • if j=m, span of wi in j is (i-1,i) • if j<m, j=A, j+1=X1Xk, span A in j is (i1,jk) NLP Lab., POSTECH

Grammar Reestimation • Using reestimation algorithm • parameter estimates for a SCFG derived by other means • grammar inferring from scratch • Grammar inferring • Given set N of Non-terminals, set  of terminals • n=|N|, t=|| • N={A1,,An}, ={b1,,bt} • CNF SCFG over N,  : n3+nt probabilities • Bp,q,r on ApAqAr: n3 • Up,m on Apbm: nt • Meaning of rule probabilities : intuition of context freeness NLP Lab., POSTECH

S i 1 s-1 s t t+1 T The Inside-Outside Algorithm • Definition of inner (e) and outer (f) probabilities i Inner probability S i Outer probability Special thanks to ohwoog NLP Lab., POSTECH

The Extended Algorithm • Compatible function • Extended algorithm • Table 1. 참조 • Inside probabilities : (1), (2) ; (2)에 compatible function 사용. • Outside probabilities : (3), (4); (4)에 compatible function 사용. • Parameter reestimation : (5), (6) ; original algorithm과 같음. • Stopping criterior • When the cross entropy estimate becomes negligible. NLP Lab., POSTECH

Complexity • Complexity of original algorithm : O(|w|3) for each sentence • computation of inside probability, computation of outside probability and rule probability reestimation : 각각 O(|w|3) for each sentence • Complexity of extended algorithm : O(|w|) at best case • In the case of full binary bracketing B of a string w • O(|w|) spans in B • Only one split point for each (i,k) • Each valid span must be a member of B. • Preprocessing • Enumerating valid spans and split points NLP Lab., POSTECH

Experimental Evaluation • Two experiments • Artificial Language ; Palindrome • Natural Language ; Penn Treebank • Evaluation • Bracketing accuracy • proportion of phrases that are compatible NLP Lab., POSTECH

Inferring the Palindrome Language • L={wwR|w{a,b}*} • Initial grammar : 135 rules ( =53+5*2 ) • Training with 100 sentences • Inferred grammar : correct palindrome language grammar • Bracketing accuracy : above 90% (100% in several cases) • In the unbracketing training : 15% - 69% NLP Lab., POSTECH

Experiments on the ATIS Corpus • ATIS(Air Travel Information System) corpus ; 770 sentences (7812 words) • 700 training set, 70 test set (901 words) • Initial grammar : 4095 rules ( =153+15*48) • 15 nonterminals, 48 terminal symbols for POS tags • Bracketing accuracy : 90.36% after 75 iteration • In the unbracketing training : 37.35% • In the case (A) • (Delta flight number) : not compatible • (the cheapest) : linguistically wrong ; lack of information • 16 incompatibles in GR • In the case (B) • fully compatible • 9 incompatibles in GR NLP Lab., POSTECH

Conclusions and Further Work • The use of partially bracketed corpus can • reduce the number of iterations for convergence • find good solution • infer grammars specifying linguistically reasonable constituent boundaries • reduce time complexity (linear in the best case) • More Extensions • determination of sensitivity to the • initial probability assignments • training corpus • lack or misplacement of brackets. • larger terminal vocabularies NLP Lab., POSTECH

CS730b 김병창 NLP Lab. 1998. 10. 29

CS730b 김병창 NLP Lab. 1998. 10. 29

Presentation Transcript

* 1998 figures are estimates from the the 1998 Multimedia Telecom Market Review and from IAD.

Kazue Takamura Postdoctoral Fellow School of Social Work McGill University

Figure from Hornberger et al. (1998)

What’s new in J2EE 1.4

1998

IQ on the Rise in Rural Kenya from 1984 to 1998

Gonorrhea Sexually Transmitted Disease Surveillance 1998

Syphilis Sexually Transmitted Disease Surveillance 1998

The National Veld and Forest Fire Act, No 101 of 1998

Table 2.1.1: New Transplant Rate per million population (pmp), 1998-2005

The Workforce Investment Act of 1998

1998