580 likes | 608 Views
Syntax-based and Factored Language Models. Rashmi Gangadharaiah April 16 th , 2008. Noisy Channel Model. Why is MT output still bad. Strong Translation models weak language models Using other knowledge sources in model building? Parse trees, taggers etc. How much improvement?
E N D
Syntax-based and Factored Language Models Rashmi Gangadharaiah April 16th, 2008
Why is MT output still bad • Strong Translation models weak language models • Using other knowledge sources in model building? • Parse trees, taggers etc. • How much improvement? • Models can be computationally expensive, • n-gram models are the least expensive models • Other models have to efficiently coded
Conventional Language models • n-gram word based language model: p(wi|h)=p(wi|wi-1,….w1) • Retain only n-1 most recent words of history to avoid storing a large number of parameters p(wi|h)=p(wi|wi-1,….wi-n+1) for n=3, p(S)=p(w1)p(w2|w1)…p(wi|wi-1,wi-2) • Estimated using MLE • Innacurate probability estimates for higher order n-grams • Smoothing/discounting to overcome sparseness
Problems still present in the n-gram model • Do not make efficient use of training corpus • Blindly discards relevant words that lie n positions in the past • Retains words of little or no value • Do not generalize well to unseen word sequences main motivation for using class-based LMs and factored LMs • Lexical dependencies are structurally related rather than sequentially related main motivation for using syntactic/structural LMs
Earlier work on incorporating low level syntactic information(1) Group words into classes(1) • P. F. Brown et. al: • Start with each word in a separate class, iteratively combine classes • Heeman’s (1998) POS LM: • achieved a perplexity reduction compared to a trigram LM by redefining the speech recognition problem: P.F. Brown et al. 1992. Class-Based n-Gram Models of Natural Language. In Computational Linguistics, 18(4):467-479 P.A. Heeman. 1998. POS tagging versus classes in language modeling. In Proceedings of the 6th Workshop on Very Large Corpora, Montreal.
Earlier work on incorporating low level syntactic information(2) • Use predictive clustering and conditional clustering • Predictive: P(Tuesday|party on)=P(WEEKDAY|party on)*P(Tuesday|party on WEEKDAY) • Conditional: P(Tuesday|party EVENT on PREPOSITION) Backoff order from P(wi|wi-2Wi-2wi-1Wi-1) P(wi|Wi-2wi-1Wi-1) (= P(Tuesday|EVENT on PREPOSITION)) to P(wi|wi-1Wi-1) (=P(Tuesday|on PRESPOSITION)) to P(wi|Wi-1) (=P(Tuesday|PREPOSITION))t o P(wi)(=P(Tuesday)) J. Goodman. 2000. Putting it all together: Language model combination. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 3, pages 1647-1650, Istanbul
More Complex Language models that we will look at today… • LMs that incorporate syntax • Charniak et al. 2003 Syntax-based LM (in MT) • LMs that incorporate both syntax and semantics • Model Headword Dependency only • N-best rescoring strategy • Chelba et al. 1997 almost parsing (in ASR) • Full parsing for decoding word lattices • Chelba et al. 1998 full parsing in left-to-right fashion with Dependency LM (in ASR) • Model both Headword and non-Headword Dependencies • N-best rescoring strategy • Wang et al. 2007 SuperARV LMs (in MT) • Kirchhoff et al. 2005 Factored LMs (in MT) • Full parsing • Rens Bod 2001 Data Oriented Parsing (in ASR) • Wang et al. 2003 (in ASR)
link grammar to model long distance dependencies (1) • Maximum Entropy language model that incorporates both syntax and semantics via dependency grammar • Motivation: • dependencies are structurally related rather than sequentially related • Incorporates the predictive power of words that lie outside of bigram or trigram range • Elements of the model: A disjunct rule shows how a word must be connected to other words in a legal parse. Ciprian Chelba, David Engle, Frederick Jelinek, Victor Jimenez, Sanjeev Khudanpur, lidia Mangu, Harry Printz, Eric Ristad, Ronald Rosenfeld, Andreas Stolcke, Dekai Wu, 1997,“Structure and performance of a dependency language model”, In Eurospeech
link grammar to model long distance dependencies (2) • Maps of histories • Mapping retains • finite context of 0,1, or 2 preceding words • a link stack consisting of open links at the current position and the identities of the words from which they emerge
link grammar to model long distance dependencies (3) • Maximum entropy formulation • to treat each of the numerous elements of [h] as a distinct predictor variable • Link grammar feature function • “[h] matches d”: d is a legal disjunct to occupy the next position in the parse • “yLz”: at least one of the links must bear label L and connect to word y
link grammar to model long distance dependencies (4) • Tagging and Parsing • Dependency parser of Michael Collins (required pos tags). • P(S,K) = P(S|K) P(K|S) • Parser didn’t operate in left to right direction hence used N-best lists. • Training and testing data drawn from Switchboard corpus and from Treebank corpus • Trained tagger on 1 million words (Ratnaparkhi), applied it on 226000 words of hand parsed training set and finally applied this on 1.44 million words, tested on 11 time marked telephone transcripts • Dependency model • Used the Maximum Entropy modeling toolkit • Generated 100 best hypothesis for each utterance • P(S)= • Achieved reduction in WER from 47.4% (adjacency bigram) to 46.4%
Syntactic structure to model long distance dependencies (1) • Language model develops syntactic structure and uses it to extract meaningful information from the word history • Motivation: • 3-gram approach would predict “after” from (7, cents) • strongest predictor should be “ended” • Syntactic structure in the past filters out irrelavant words Ciprian Chelba and Frederick Jelinek,1998 “Exploiting Syntactic Structure for Language modeling”, ACL Headword of (ended(with(..))) Exposed headword when predicting “after”
Syntactic structure to model long distance dependencies (2) • Terminology • Wk: word k-prefix w0….wk of the sentence • WkTk: the word parse k-prefix • A word-parse k-prefix contains – for a given parse only those binary subtrees whose span is completely included in the word k-prefix excluding w0 = <s> • Single words along with their POStag can be regarded as root-only trees
Syntactic structure to model long distance dependencies (3) • Model operates by means of three modules • WORD-PREDICTOR • Predicts the next word wk+1 given the word-parse k-prefix and passes control to the TAGGER • TAGGER • Predicts the POS tag of the next word tk+1 given the word-parse k-prefix and wk+1 and passes control to the PARSER • PARSER • Grows the already existing binary branching structure by repeatedly generating the transitions: (unary, NTlabel), (adjoin-left, NTlabel) or (adjoin-right, NTlabel) until it passes control to the PREDICTOR by taking a null transition
Syntactic structure to model long distance dependencies (4) • Probabilistic Model: • Word Level Perplexity
Syntactic structure to model long distance dependencies (5) • Search strategy • Synchronous multi-stack search algorithm • Each stack contains partial parses constructed by the same number of predictor and parser operations • Hypotheses ranked according to ln(P(W,T)) score • Width controlled by maximum stack depth and log-probability threshold • Parameter Estimation • Solution inspired by HMM re-estimation technique HMM re-estimation technique that works on pruned N-best trellis (Byrne) • binarized the UPenn Treebank parse trees and percolated the headwords using a rule-based approach W. Byrne, A. Gunawardhana, and S. Khudanpur, 1998. “Information geometry and EM variants”. Technical Report CLSP Research Note 17.
Syntactic structure to model long distance dependencies (6) • Setup - Upenn Treebank corpus • Stack depth=10, log-probability threshold=6.91 nats • Training data: 1Mwds of training data, word vocabulary:10k, POS tag vocabulary=40, non-terminal tag vocabulary=52 • Test data: 82430 words • Results • Reduced test-set perplexity from 167.14(trigram model) to 158.28 • Interpolating the model with a trigram model resulted in 148.90 (interpolation weight = 0.36)
Non-headword dependencies matter : DOP-based LM(1) • The DOP (Data Oriented Parsing) model learns a stochastic tree-substitution grammar (STSG) from • a treebank by extracting all subtrees from the treebank • assigning probabilities to the subtrees • DOP takes into account both headword and non-headword dependencies • Subtrees are lexicalized at their frontiers with one or more words • Motivation • Head lexicalized grammar is limited • It cannot capture dependencies between non-headwords • Eg: “more people than cargo”, “more couples exchanging rings in 1988 than in the previous year” (from WSJ) • Neither “more” nor “than” are headwords of these phrases • Dependency between “more” and “than” is captured by a subtree where “more” and “than” are the only frontier words. Rens Bod, 2000 “combining semantic and syntactic structure for language modeling”
Non-headword dependencies matter: DOP-based LM(2) • DOP learns an STSG from a treebank by taking all subtrees in that treebank • Eg: Consider a Treebank
Non-headword dependencies matter: DOP-based LM(3) • New sentences may be derived by combining subtrees from the treebank • Node substitution is left-associative • Other derivations may yield the same parse tree
Non-headword dependencies matter : DOP-based LM(4) • Model computes the probability of a subtree as: r(t): root label of t • Probability of a derivation • Probability of a parse tree • Probability of a word string W • Note: • does not maximize the likelihood of the corpus • implicit assumption that all derivations of a parse tree contribute equally to the total probability of the parse tree. • There is a hidden component DOP can be trained using EM
Non-headword dependencies matter : DOP-based LM(5) • Combining semantic and syntactic structure
Non-headword dependencies matter : DOP-based LM(6) • Computation of the most probable string • NP hard : Employed Viterbi n best search • Estimate the most probable string by the 1000 most probable derivations • OVIS corpus • 10,000 user utterances about Dutch public transport information, syntactically and semantically annotated • DOP model obtained by extracting all subtrees of depth upto 4
More and More features (A Hybrid) :SuperARV LMs (1) • SuperARV LM is a highly lexicalized probabilistic LM based on the Constraint Dependency Grammar (CDG) • CDG represents a parse as assignments of dependency relations to functional variables (roles) associated with each word in a sentence • Motivation • High levels of word prediction capability can be achieved by tightly integrating knowledge of words, structural constraints, morphological and lexical features at the word level. Wen Wang, Mary P. Harper, 2002 “The SuperARV Language Model: Investigating the Effectiveness of Tightly Integrating Multiple Knowledge Sources”, ACL
More and More features (A Hybrid) :SuperARV LMs (2) • CDG parse • Each word in the parse has a lexical category and a set of feature values • Each word has a governor role (G) • Comprised of a label (indicates position of the words head/governor) and modifiee • Need roles are used to ensure the grammatical requirements of a word are met • Mechanism for using non-headword dependencies
More and More features (A Hybrid) :SuperARV LMs (3) • ARVs and ARVPs • Using the relationship between a role value’s position and its modifiee’s position, unary and binary constraints can be represented as a finite set of ARVs and ARVPs
More and More features (A Hybrid) :SuperARV LMs (4) • SuperARVs • Four-tuple for a word <C,F,(R,L,UC,MC)+,DC> • Abstraction of the joint assignment of dependencies for a word • a mechanism for lexicalizing CDG parse rules • Encode lexical information, syntactic and semantic constraints – much more fine grained than POS
More and More features (A Hybrid) :SuperARV LMs (5) • SuperARV LM estimates the joint probability of words w1N and their SuperARV tags t1N • SuperARV LM does not encode the word identity at the data structure level since this can cause serious data sparsity problems • Estimate the probability distributions • recursive linear interpolation • WER on WSJ CSR 20k test sets, • 3gram=14.74, SARV=14.28, Chelba=14.36
More and More features (A Hybrid) :SuperARV LMs (6) • SCDG Parser • Probabilistic generative model • For S, parser returns the parse T that maximizes its probability • First step: • N-best SuperARV assignments are generated • Each SuperARV sequence is represented as: (w1, s1), . . . , (wn sn) • Second step: the modifiees are statistically specified in a left-to-right manner. • determine the left dependants of wkfrom the closest to the farthest • also determine whether wk could be the (d+1)th right dependent of a previously seen word wp, p = 1,. . . , k – 1 • d denotes the number of already assigned right dependents of wp Wen Wang and Mary P. Harper, 2004, A Statistical Constraint Dependency Grammar (CDG) Parser, ACL
More and More features (A Hybrid) :SuperARV LMs (7) • SCDG Parser (contd.) • Second step (contd.) • After processing word wk in each partial parse on the stack, the partial parses are re-ranked according to their updated probabilities. • parsing algorithm is implemented as a simple best first search • Two pruning thresholds: maximum stack depth and maximum difference between the log probabilities of the top and bottom partial parses in the stack • WER • LM training data for this task is composed of the 1987-1989 files containing 37,243,300 words • evaluate all LMs on the 1993 20k open vocabulary DARPA WSJ CSR evaluation set (denoted 93-20K), which consists of 213 utterances and 3,446 words. • 3gram=13.72, Chelba=13.0, SCDG LM=12.18
More and More features (A Hybrid) :SuperARV LMs (8) • Employ LMs for N-best re-ranking in MT • Two pass decoding • First pass: generate N-best lists • Uses a hierarchical phrase decoder with standard 4-gram LM • Second pass: • Rescore the N-best lists using several LMs trained on different corpora and estimated in different ways • Scores are combined in a log-linear modeling framework • Along with other features used in SMT • Rule probabilities P(f|e), p(e|f); lexical weights pw(f|e) pw(e|f), sentence length and rule counts • Optimized weights (GALE dev07) using minimum error training method to maximize BLEU search • Blind test set NIST MT eval06 GALE portion (eval06) Wen Wang, Andreas Stolcke and Jing Zheng, Dec 2007"Reranking machine translation hypothesis with structured and web-based language models“, ASRU. IEEE Workshop
More and More features (A Hybrid) :SuperARV LMs (9) • Structured LMs • Almost parsing LM • Parsing LM • Using baseNP model • Given W, generates the reduced sentence W’ by marking all baseNPs and then reducing all baseNPs to their headwords • Further simplification of the parser LM
More and More features (A Hybrid) :SuperARV LMs (10) • LMs for searching: • 4-gram (4g) LM • English side of Arabic-English and Chinese-English from LDC • All of English BN and BC, webtexts and translations for Mandarin and Arabic BC BN released under DARPA EARS and GALE • LDC2005T12, LDC95T21, LDC98T30 • Webdata collected by SRI and BBN • LMs for reranking (N-best list size =3000) • 5 gram count LM Google LM (google) (1 terawords) • 5 gram count LM Yahoo LM (yahoo) (3.4G words) • first two sources for training almost parsing LM(sarv) • the second source for training the parser LM(plm) • 5-gram count-LM on all BBN webdata (wlm)
Syntax-based LMs (1) • Performs translation by assuming the target language specifies not just words but complete parse • Yamada 2002: incomplete use of syntactic information • Decoder optimized, language model was not • Develop a system in which the translation model of [Yamada 2001] is “married” to the syntax-based language model of [Charniak 2001] Kenji Yamada and Kevin Knight. “A Syntax-based statistical translation model”, 2001 Kenji Yamada and Kevin Knight, ”A decoder for syntax-based statistical mt”, 2002 Eugene Charniak,”Immediate-head parsing for language models” 2001 Eugene Charniak, Kevin Knight and Kenji Yamada, "Syntax-based Language Models for Statistical Machine Translation"
Syntax-based LMs (2) • Translation model has 3 operations: • Reorders child nodes, inserts an optional word, translates the leaf words • Ө varies over the possible alignments between the F and E • Decoding algorithm similar to a regular parser • Build English parse tree from Chinese sentence • Extract CFG rules from parsed corpus of English • Supplement each non-lexical English CFG rule (VPVB NP) with all possible reordered rules (VPNP PP VB, VPPP NP VB, etc) • Add extra rules “VPVP X” and “Xword” for insertion operations • Also add “englishwordchineseword” for translation
Syntax-based LMs (3) • Now we can parse a Chinese sentence and extract English parse tree • Removing leaf Chinese words • Recovering the reordered child nodes into English order • pick the best tree • the product of the LM probability and the TM probability is the highest.
Syntax-based LMs (4) • Decoding process • First build a forest using the bottom-up decoding parser using only P(F|E) • Pick the best tree from the forest with a LM • Parser/Language model (Charniak 2001) • Takes an English sentence and uses two parsing stages • Simple non-lexical PCFG to create a large parse forest • Pruning step • Sophisticated lexicalized PCFG is applied to the sentence
Syntax-based LMs (5) • Evaluation • 347 previously unseen Chinese newswire sentences • 780,000 English parse tree-Chinese sentence pairs • YC: TM of Yamada2001 and LM of Charniak2001 • YT: TM of Yamada2001, trigram LM Yamada2002 • BT: TM of Peter et. al, 1993, trigram LM and greedy decoder Ulrich Germann 1997 Eugene Charniak. A maximum-entropy-inspired Parser, 2000 Ulrich Germann, Michael Jahr, Daniel Marcu, Kevin Knight, , and Kenji Yamada. Fast decoding and optimal deciding for machine translation, 2001
Factored Language models(1) • Allow a larger set of conditioning variables predicting the current word • morphological, syntactic or semantic word features, etc. • Motivation • Statistical language modeling is a difficult problem for languages with rich morphologyhigh perplexity • Probability estimates are unreliable even with smoothing • Features mentioned above are shared by many words and hence can be used to obtain better smoothed probability estimates Katrin Kirchhoff and Mei Yang, 2005 "Improved Language modeling for Statistical Machine Translation“, ACL
Factored Language models(2) • Factored Word Representation • Decompose words into sets of features (or factors) • Probabilistic language models constructed over subsets of word features. • Word is equivalent to a fixed number of factors W=f1:K
Factored Language models(3) • Probability model • Standard generalized parallel backoff c: count of (wt,wt-1,wt-2), pML:ML distribution, dc: discounting factor τ3:count threshold, α:normalization factor • N-grams whose counts are above the threshold retain their ML estimates discounted by a factor that redistributes probability mass to the lower-order distribution
Factored Language models(4) • Backoff paths
Factored Language models(5) • Backoff paths • Space of possible models is extremely large • Ways of choosing among different paths • Linguistically determined, • Eg: drop syntactic before morphological variables • Usually leads to sub-optimal results • Choose path at runtime based on statistical criteria • Choose multiple paths and combine their probability estimates C: count of (f,f1,f2,f3), pML:ML distribution, τ4:count threshold, α:normalization factor, g:determines the backoff strategy, can be any non-negative function of f,f1,f2,f3
Factored Language models(6) • Learning FLM Structure • Three types of parameters need to be specified • Initial conditioning factors, backoff graph, smoothing options • Model space is extremely large • Find best model structure automatically • Genetic algorithms(GA) • Class of evolution-inspired search/optimization techniques • Encode problem solutions as strings (genes) and evolve and test successive populations of solutions through the use of genetic operators (selection, crossover, mutation) applied to encoded strings • Solution evaluates according to a fitness function which represents the desired optimization criteria • No guarantee of finding the optimal solution, they find good solutions quickly
Factored Language models(7) • Structure Search using GA • Conditioning factors • Encoded as binary strings • Eg: with 3 factors (A,B,C), 6 conditioning variables {A-1,B-1,C-1,A-2,B-2,C-2} • String 10011 corresponds to F={A-1,B-2,C-2} • Backoff graph Large number of possible paths • Encode a binary string in terms of graph grammar rules • 1 indicating the use of the rule and 0 for non-use
Factored Language models(8) • Structure Search using GA (contd.) • Smoothing options • Encoded as tuples of integers • First integerdiscounting method • Second integerbackoff threshold • Integer string consists of successive concatenated tuples each representing the smoothing option at a node in the graph • GA operators are applied to concatenations of all three substrings describing the set of factors, backoff graph and smoothing options to jointly optimize all parameters
Factored Language models(9) • Data: ACL05 shared MT task website for 4 language pairs • Finnish, Spanish, French to English • Development set provided by the website:2000 sentences • Trained using GIZA++ • Pharaoh for phrase based decoding • Trigram word LM trained using SRILM toolkit with Kneser-Ney smoothing and interpolation of higher and lower order n-grams • Combination weights trained using minimum error weight optimization (Pharaoh)