DKRLM: Discriminant Knowledge-Rich Language Modeling for Machine Translation

DKRLM: Discriminant Knowledge-Rich Language Modeling for Machine Translation Alon Lavie “Visionary Talk” LTI Faculty Retreat May 4, 2007

Background: Search-based MT • All state-of-the-art MT approaches work within a general search-based paradigm • Translation Models “propose” pieces of translation for various sub-sentential segments • Decoder puts these pieces together into complete translation hypotheses and searches for the best scoring hypothesis • (Target) Language Modeling is the most dominant source of information in scoring alternative translation hypotheses DKRLM

The Problem • Most MT systems use standard statistical LMs that come from SR, usually “as is” • SRI-LM toolkit, CMU/CU LM, SALM toolkit • Until recently, usually trigram models • The Problem: these LMs are not good at discriminating between good and bad translations! • How do we know? • Oracle experiments on n-best lists of MT output consistently show that far better translations are “hiding” in the n-best lists but are not being selected by our MT systems • Also true of our MEMT system… which led me to start thinking about this problem! DKRLM

The Problem • Why do standard statistical LMs not work well for MT? • MT hypotheses are very different from SR hypotheses • Speech: mostly correct word-order, confusable homonyms • MT: garbled syntax and word-order, wrong choices for some translated words • MT violates some basic underlying assumptions of statistical LMs: • Indirect Discrimination: better translations should have better LM scores, but LMs are not trained to directly discriminate between good and bad translations! • Fundamental Probability Estimation Problems: Backoff “Smoothing” for unseen n-grams is based on an assumption of training data sparsity, but the majority of n-grams in MT hypotheses have not been seen because they are notgrammatical (they really should have a zero probability!) DKRLM

The New Idea • Rather than attempting to model the probabilities of unseen n-grams, we look at the problem differently: • Extract instances of lexical, syntactic and semantic features from each translation hypothesis • Determine whether these instances have been “seen before” (at least once) in a large monolingual corpus • The Conjecture: more grammatical MT hypotheses are likely to contain higher proportions of feature instances that have been seen in a corpus of grammatical sentences. • Goals: • Find the set of features that provides the best discrimination between good and bad translations • Learn how to combine these into a LM-like function for scoring alternative MT hypotheses DKRLM

Outline • Knowledge-Rich Features • Preliminary Experiments: • Compare feature occurrence statistics for MT hypotheses versus human-produced (reference) translations • Compare ranking of MT and “human” systems according to statistical LMs versus a function based on long n-gram occurrence statistics • Compare n-grams and n-chains as features for binary classification “human versus MT” • Research Challenges • New Connections with IR DKRLM

Knowledge-Rich Features • Lexical Features: • “long” n-gram sequences (4 words and up) • Syntactic/Semantic Features: • POS n-grams • Head-word Chains • Specific types of dependencies: • Verbs and their dependents • Nouns and their dependents • “long-range” dependencies • Content word co-occurrence statistics • Mixtures of Lexical and Syntactic Features: • Abstracted versions of word n-gram sequences, where words are replaced by POS tags or Named-entity tags DKRLM

Head-Word Chains (n-chains) The boy ate the red apple • Head-word Chains are chains of syntactic dependency links (from dependent to their heads) • Bi-chains:[theboy] [boyate] [theapple] [redapple] [appleate] • Tri-chains:[theboyate] [theappleate] [redappleate] • Four-chains: none (for this example)! DKRLM

Specific Types of Dependencies • Some types of syntactic dependencies may be more important than others for MT • Consider specific types of dependencies that are most important for syntactic and semantic structure: • Dependencies involving content words • Long-distance dependencies • Verb/argument dependencies: focus only on the bi-chains where the head is the verb: [boyate] and [appleate] • Noun/modifier dependencies: focus only on the bi-chains where the noun is the head: [theboy][anapple] [redapple] DKRLM

Feature Occurrence Statistics for MT Hypotheses • The general Idea: determine the fraction of feature instances that have been observed to occur in a large human-produced corpus • For n-grams: • Extract all n-gram sequences of order n from the hypothesis • Look-up whether each n-gram instance occurs in the corpus • Calculate fractions of “found” n-grams for each order n • For n-chains: • Parse the MT hypothesis (into dependency structure) • Look-up whether each n-chain instance occurs in a database of n-chains extracted from the large corpus • Calculate fractions of “found” n-chains for each order n DKRLM

Content-word Co-occurrence Statistics • Content-word co-occurrences: (unordered) pairs of content words (nouns, verbs, adjectives, adverbs) that co-occur in the same sentence • Restricted version: subset of co-occurrences that are in a direct syntactic dependency within the sentence (subset of bi-chains) • Idea: • Learn co-occurrence pair strengths from large monolingual corpora using statistical measures: DICE, t-score, chi-square, likelihood ratio • Use average co-occurrence pair strength as a feature for scoring MT hypotheses • Weak way of capturing the syntax/semantics within sentences • Preliminary experiments show that these features are somewhat effective in discriminating between MT output and human references • Thanks Ben Han! [MT Lab Project, 2005] DKRLM

Preliminary Experiments I • Goal: compare n-gram occurrence statistics for MT hypotheses versus human-produced (reference) translations • Setup: • Data: NIST Arabic-to-English MT-Eval 2003 (about 1000 sentences) • Output from three strong MT systems and four reference translations • Used Suffix-Array LM toolkit [Zhang and Vogel 2006] modified to return for each string call the length of the longest suffix of the string that occurs in the corpus • SALM used to index a subset of 600 million words from the Gigaword corpus • Searched for all n-gram sequences of length eight extracted from the translation • Thanks to Greg Hanneman! DKRLM

Preliminary Experiments I DKRLM

Preliminary Experiments II • Goal: Compare ranking of MT and “human” systems according to statistical LMs versus a function based on long n-gram occurrence statistics • Same data setup as in the first experiment • Calculate sentence scores as average per word LM score • System score is average over all its sentence scores • Score each system with three different LMs: • SRI-LM trigram LM trained on 260 million words • SALM suffix-array LM trained on 600 million words • A new function that assigns exponentially more weight to longer n-gram “hits”: DKRLM

Preliminary Experiments II DKRLM

Preliminary Experiments III • Goal: Directly discriminate between MT and human translations using a binary SVM classifier trained on n-gram versus n-chain occurrence statistics • Setup: • Data: NIST Chinese-to-English MT-Eval 2003 (919 sentences) • Four MT system outputs and four human reference translations • N-chain database created using SALM by extracting all n-chains from a dependency-parsed version of the English Europarl corpus (600K sentences) • Train SVM classifier on 400 sentences from two MT systems and two human “systems” • Test classification accuracy on 200 unseen test sentences from the same MT and human systems • Features for SVM: n-gram “hit” fractions (all n) vs. n-chain fractions • Thanks to Vamshi Ambati DKRLM

Preliminary Experiments III • Results: • Experiment 1: • N-gram classifier: 49% accuracy • N-chain classifier: 69% accuracy • Experiment 2: • N-gram classifier: 52% accuracy • N-chain classifier: 63% accuracy • Observations: • Mixing both n-gram and n-chains did not improve classification accuracy • Features include both high and low-order instances (did not try with only high-order ones) • N-chain database is from different domain than test data, and not a very large corpus DKRLM

Preliminary Conclusions • Statistical LMs do not discriminate well between MT hypotheses and human reference translations  also poor in discriminating between good and bad MT hypotheses • Long n-grams and n-chains occurrence statistics differ significantly between MT hypotheses and human reference translations • Can potentially be useful as discriminant features for identifying better (more grammatical and fluent) translations DKRLM

Research Challenges • Develop Infrastructure for Computing with Knowledge-Rich Features • Scale up to querying against much larger monolingual corpora (terabytes and up) • Parsing and annotation of such vast corpora • Explore more complex features • Finding the set of features that are most discriminant • Develop Methodologies for training LM-like discriminant scoring functions: • SVM and/or other classifiers on MT versus human • SVM and/or other classifiers on MT versus MT “Oracle” • Direct regression against human judgments • Parameter optimization for maximizing automatic MT metric scores (BLEU, METEOR, etc.) • “Incremental” features that can be used during decoding versus full set of features for n-best list reranking DKRLM

New Connections with IR • The “occurrence-based” formulation of the LM problem transforms it from a counting and estimation problem to an IR-like querying problem: • To be effective, we think this may require querying against extremely large volumes of monolingual text, and structured versions of such text  can we do this against local snapshots of the entire web? • SALM suffix-array infrastructure can currently handle up to about the size of the Gigaword corpus (within 16GB memory) • Can IR engines such as LEMUR/Indri be adapted to the task? DKRLM

New Connections with IR • Challenges this type of task imposes on IR (insights from Jamie Callan): • The larger issue: IR search engines as query interfaces to vast collections of structured text: • Building an index suitable for very fast “n-gram” lookups that satisfy certain properties. • The n-gram sequences might be a mix of surface features and derived features based on text annotations, e.g., $PersonName, or POS=N • Specific Challenges: • How to build such indexes for fast access? • What does the query language look like? • How to deal with memory/disk vs. speed tradeoff issues? • Can we get LTI students to do this kind of research? DKRLM

Final Words… • Novel and exciting new research direction  there are at least one or two PhD theses hiding in here… • Submitted as a grant proposal to NSF last December (jointly with Rebecca Hwa from Pitt) • Influences: Some of these ideas were influenced by Jaime’s CBMT work, and by Rebecca’s work on using syntactic features for automatic MT evaluation metrics • Acknowledgments: • Thanks to Joy Zhang and Stephan Vogel for making the the SALM toolkit available to us • Thanks to Rebecca Hwa and to my students Ben Han, Greg Hanneman and Vamshi Ambati for preliminary work on these ideas. DKRLM

DKRLM: Discriminant Knowledge-Rich Language Modeling for Machine Translation

DKRLM: Discriminant Knowledge-Rich Language Modeling for Machine Translation

Presentation Transcript

Machine Translation

Machine Translation

Machine Translation

Building a Large-Scale Knowledge Base for Machine Translation

Machine Translation

Machine Translation

Language Models for Machine Translation: Original vs. Translated Texts

Machine Translation

Large Language Models in Machine Translation

Sign Language Representation for Machine Translation

Knowledge for Knowledge Translation

Machine Translation Challenges and Language Divergences

Machine Translation

Knowledge Translation

Machine Translation

Knowledge Translation

Knowledge Translation for Policymakers

Building a Large-Scale Knowledge Base for Machine Translation

Knowledge Translation

Machine Translation, Free Machine Translation