250 likes | 425 Views
Course Summary. LING 575 Fei Xia 03/06/07. Outline. Introduction to MT: 1 Major approaches SMT: 3 Transfer-based MT: 2 Hybrid systems: 2 Other topics. Introduction to MT. Major challenges. Translation is hard. Getting the right words: Choosing the correct root form
E N D
Course Summary LING 575 Fei Xia 03/06/07
Outline • Introduction to MT: 1 • Major approaches • SMT: 3 • Transfer-based MT: 2 • Hybrid systems: 2 • Other topics
Major challenges • Translation is hard. • Getting the right words: • Choosing the correct root form • Getting the correct inflected form • Inserting “spontaneous” words • Putting the words in the correct order: • Word order: SVO vs. SOV, … • Unique constructions: • Divergence
Lexical choice • Homonymy/Polysemy: bank, run • Concept gap: no corresponding concepts in another language: go Greek, go Dutch, fen sui, lame duck, … • Coding (Concept lexeme mapping) differences: • More distinction in one language: e.g., kinship vocabulary. • Different division of conceptual space:
Major approaches • Transfer-based • Interlingua • Example-based (EBMT) • Statistical MT (SMT) • Hybrid approach
The MT triangle Meaning (interlingua) Synthesis Analysis Transfer-based Phrase-based SMT, EBMT Word-based SMT, EBMT word Word
Evaluation • Unlike many NLP tasks (e.g., tagging, chunking, parsing, IE, pronoun resolution), there is no single gold standard for MT. • Human evaluation: accuracy, fluency, … • Problem: expensive, slow, subjective, non-reusable. • Automatic measures: • Edit distance • Word error rate (WER), Position-independent WER (PER) • Simple string accuracy (SSA), Generation string accuracy (GSA) • BLEU
Word-based SMT • IBM Models 1-5 • Main concepts: • Source channel model • Hidden word alignment • EM training
Source channel model for MT P(E) P(F | E) Fr sent Eng sent Noisy channel • Two types of parameters: • Language model: P(E) • Translation model: P(F | E)
Modeling Model 1: Model 2: • Parameters: • Length prob: P(m | l) • Translation prob: t(fj | ei) • Distortion prob (for Model 2): d(i | j, m, l)
Training • Model 1:
Finding the best alignment Given E and F, we are looking for Model 1:
Clump-based SMT • The unit of translation is a clump. • Training stage: • Word alignment • Extracting clump pairs • Decoding stage: • Try all segmentations of the src sent and all the allowed permutations • For each src clump, try TopN tgt clumps • Prune the hypotheses
Transfer-based MT • Analysis, transfer, generation: • Example: (Quirk et al., 2005) • Parse the source sentence • Transform the parse tree with transfer rules • Translate source words • Get the target sentence from the tree • Translation as parsing: • Example: (Wu, 1995)
Hybrid approaches • Preprocessing with transfer rules: (Xia and McCord, 2004), (Collins et al, 2005) • Postprocessing with taggers, parsers, etc: JHU 2003 workshop • Hierarchical phrase-based model: (Chiang, 2005) • …
Other issues • Resources • MT for Low density languages • Using comparable corpora and wikipedia • Special translation modules • Identifying and translating name entities and abbreviations • …
To build an MT system (1) • Gather resources • Parallel corpora, comparable corpora • Grammars, dictionaries, … • Process data • Document alignment, sentence alignment • Tokenization, parsing, …
To build an MT system (2) • Modeling • Training • Word alignment and extracting clump pairs • Learning transfer rules • Decoding • Identifying entities and translating them with special modules (optional) • Translation as parsing, or parse + transfer + translation • Segmenting src sentence, replace src clump with target clump, …
To build an MT system (3) • Post-processing • System combination • Reranking • Using the system for other applications: • Cross-lingual IR • Computer-assisted translation • ….
Misc • Grades • Assignments ( hw1-hw3): 30% • Class participation: 20% • Project: • Presentation: 25% • Final paper: 25%