220 likes | 314 Views
DKRLM: Discriminant Knowledge-Rich Language Modeling for Machine Translation. Alon Lavie “Visionary Talk” LTI Faculty Retreat May 4, 2007. Background: Search-based MT. All state-of-the-art MT approaches work within a general search-based paradigm
E N D
DKRLM: Discriminant Knowledge-Rich Language Modeling for Machine Translation Alon Lavie “Visionary Talk” LTI Faculty Retreat May 4, 2007
Background: Search-based MT • All state-of-the-art MT approaches work within a general search-based paradigm • Translation Models “propose” pieces of translation for various sub-sentential segments • Decoder puts these pieces together into complete translation hypotheses and searches for the best scoring hypothesis • (Target) Language Modeling is the most dominant source of information in scoring alternative translation hypotheses DKRLM
The Problem • Most MT systems use standard statistical LMs that come from SR, usually “as is” • SRI-LM toolkit, CMU/CU LM, SALM toolkit • Until recently, usually trigram models • The Problem: these LMs are not good at discriminating between good and bad translations! • How do we know? • Oracle experiments on n-best lists of MT output consistently show that far better translations are “hiding” in the n-best lists but are not being selected by our MT systems • Also true of our MEMT system… which led me to start thinking about this problem! DKRLM
The Problem • Why do standard statistical LMs not work well for MT? • MT hypotheses are very different from SR hypotheses • Speech: mostly correct word-order, confusable homonyms • MT: garbled syntax and word-order, wrong choices for some translated words • MT violates some basic underlying assumptions of statistical LMs: • Indirect Discrimination: better translations should have better LM scores, but LMs are not trained to directly discriminate between good and bad translations! • Fundamental Probability Estimation Problems: Backoff “Smoothing” for unseen n-grams is based on an assumption of training data sparsity, but the majority of n-grams in MT hypotheses have not been seen because they are notgrammatical (they really should have a zero probability!) DKRLM
The New Idea • Rather than attempting to model the probabilities of unseen n-grams, we look at the problem differently: • Extract instances of lexical, syntactic and semantic features from each translation hypothesis • Determine whether these instances have been “seen before” (at least once) in a large monolingual corpus • The Conjecture: more grammatical MT hypotheses are likely to contain higher proportions of feature instances that have been seen in a corpus of grammatical sentences. • Goals: • Find the set of features that provides the best discrimination between good and bad translations • Learn how to combine these into a LM-like function for scoring alternative MT hypotheses DKRLM
Outline • Knowledge-Rich Features • Preliminary Experiments: • Compare feature occurrence statistics for MT hypotheses versus human-produced (reference) translations • Compare ranking of MT and “human” systems according to statistical LMs versus a function based on long n-gram occurrence statistics • Compare n-grams and n-chains as features for binary classification “human versus MT” • Research Challenges • New Connections with IR DKRLM
Knowledge-Rich Features • Lexical Features: • “long” n-gram sequences (4 words and up) • Syntactic/Semantic Features: • POS n-grams • Head-word Chains • Specific types of dependencies: • Verbs and their dependents • Nouns and their dependents • “long-range” dependencies • Content word co-occurrence statistics • Mixtures of Lexical and Syntactic Features: • Abstracted versions of word n-gram sequences, where words are replaced by POS tags or Named-entity tags DKRLM
Head-Word Chains (n-chains) The boy ate the red apple • Head-word Chains are chains of syntactic dependency links (from dependent to their heads) • Bi-chains:[theboy] [boyate] [theapple] [redapple] [appleate] • Tri-chains:[theboyate] [theappleate] [redappleate] • Four-chains: none (for this example)! DKRLM
Specific Types of Dependencies • Some types of syntactic dependencies may be more important than others for MT • Consider specific types of dependencies that are most important for syntactic and semantic structure: • Dependencies involving content words • Long-distance dependencies • Verb/argument dependencies: focus only on the bi-chains where the head is the verb: [boyate] and [appleate] • Noun/modifier dependencies: focus only on the bi-chains where the noun is the head: [theboy][anapple] [redapple] DKRLM
Feature Occurrence Statistics for MT Hypotheses • The general Idea: determine the fraction of feature instances that have been observed to occur in a large human-produced corpus • For n-grams: • Extract all n-gram sequences of order n from the hypothesis • Look-up whether each n-gram instance occurs in the corpus • Calculate fractions of “found” n-grams for each order n • For n-chains: • Parse the MT hypothesis (into dependency structure) • Look-up whether each n-chain instance occurs in a database of n-chains extracted from the large corpus • Calculate fractions of “found” n-chains for each order n DKRLM
Content-word Co-occurrence Statistics • Content-word co-occurrences: (unordered) pairs of content words (nouns, verbs, adjectives, adverbs) that co-occur in the same sentence • Restricted version: subset of co-occurrences that are in a direct syntactic dependency within the sentence (subset of bi-chains) • Idea: • Learn co-occurrence pair strengths from large monolingual corpora using statistical measures: DICE, t-score, chi-square, likelihood ratio • Use average co-occurrence pair strength as a feature for scoring MT hypotheses • Weak way of capturing the syntax/semantics within sentences • Preliminary experiments show that these features are somewhat effective in discriminating between MT output and human references • Thanks Ben Han! [MT Lab Project, 2005] DKRLM
Preliminary Experiments I • Goal: compare n-gram occurrence statistics for MT hypotheses versus human-produced (reference) translations • Setup: • Data: NIST Arabic-to-English MT-Eval 2003 (about 1000 sentences) • Output from three strong MT systems and four reference translations • Used Suffix-Array LM toolkit [Zhang and Vogel 2006] modified to return for each string call the length of the longest suffix of the string that occurs in the corpus • SALM used to index a subset of 600 million words from the Gigaword corpus • Searched for all n-gram sequences of length eight extracted from the translation • Thanks to Greg Hanneman! DKRLM
Preliminary Experiments II • Goal: Compare ranking of MT and “human” systems according to statistical LMs versus a function based on long n-gram occurrence statistics • Same data setup as in the first experiment • Calculate sentence scores as average per word LM score • System score is average over all its sentence scores • Score each system with three different LMs: • SRI-LM trigram LM trained on 260 million words • SALM suffix-array LM trained on 600 million words • A new function that assigns exponentially more weight to longer n-gram “hits”: DKRLM
Preliminary Experiments III • Goal: Directly discriminate between MT and human translations using a binary SVM classifier trained on n-gram versus n-chain occurrence statistics • Setup: • Data: NIST Chinese-to-English MT-Eval 2003 (919 sentences) • Four MT system outputs and four human reference translations • N-chain database created using SALM by extracting all n-chains from a dependency-parsed version of the English Europarl corpus (600K sentences) • Train SVM classifier on 400 sentences from two MT systems and two human “systems” • Test classification accuracy on 200 unseen test sentences from the same MT and human systems • Features for SVM: n-gram “hit” fractions (all n) vs. n-chain fractions • Thanks to Vamshi Ambati DKRLM
Preliminary Experiments III • Results: • Experiment 1: • N-gram classifier: 49% accuracy • N-chain classifier: 69% accuracy • Experiment 2: • N-gram classifier: 52% accuracy • N-chain classifier: 63% accuracy • Observations: • Mixing both n-gram and n-chains did not improve classification accuracy • Features include both high and low-order instances (did not try with only high-order ones) • N-chain database is from different domain than test data, and not a very large corpus DKRLM
Preliminary Conclusions • Statistical LMs do not discriminate well between MT hypotheses and human reference translations also poor in discriminating between good and bad MT hypotheses • Long n-grams and n-chains occurrence statistics differ significantly between MT hypotheses and human reference translations • Can potentially be useful as discriminant features for identifying better (more grammatical and fluent) translations DKRLM
Research Challenges • Develop Infrastructure for Computing with Knowledge-Rich Features • Scale up to querying against much larger monolingual corpora (terabytes and up) • Parsing and annotation of such vast corpora • Explore more complex features • Finding the set of features that are most discriminant • Develop Methodologies for training LM-like discriminant scoring functions: • SVM and/or other classifiers on MT versus human • SVM and/or other classifiers on MT versus MT “Oracle” • Direct regression against human judgments • Parameter optimization for maximizing automatic MT metric scores (BLEU, METEOR, etc.) • “Incremental” features that can be used during decoding versus full set of features for n-best list reranking DKRLM
New Connections with IR • The “occurrence-based” formulation of the LM problem transforms it from a counting and estimation problem to an IR-like querying problem: • To be effective, we think this may require querying against extremely large volumes of monolingual text, and structured versions of such text can we do this against local snapshots of the entire web? • SALM suffix-array infrastructure can currently handle up to about the size of the Gigaword corpus (within 16GB memory) • Can IR engines such as LEMUR/Indri be adapted to the task? DKRLM
New Connections with IR • Challenges this type of task imposes on IR (insights from Jamie Callan): • The larger issue: IR search engines as query interfaces to vast collections of structured text: • Building an index suitable for very fast “n-gram” lookups that satisfy certain properties. • The n-gram sequences might be a mix of surface features and derived features based on text annotations, e.g., $PersonName, or POS=N • Specific Challenges: • How to build such indexes for fast access? • What does the query language look like? • How to deal with memory/disk vs. speed tradeoff issues? • Can we get LTI students to do this kind of research? DKRLM
Final Words… • Novel and exciting new research direction there are at least one or two PhD theses hiding in here… • Submitted as a grant proposal to NSF last December (jointly with Rebecca Hwa from Pitt) • Influences: Some of these ideas were influenced by Jaime’s CBMT work, and by Rebecca’s work on using syntactic features for automatic MT evaluation metrics • Acknowledgments: • Thanks to Joy Zhang and Stephan Vogel for making the the SALM toolkit available to us • Thanks to Rebecca Hwa and to my students Ben Han, Greg Hanneman and Vamshi Ambati for preliminary work on these ideas. DKRLM