640 likes | 788 Views
“Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation. Ferhan Ture Dissertation defense May 24 th , 2013 Department of Computer Science University of Maryland at College Park. Motivation. f orum posts. multi-lingual text.
E N D
“Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation • Ferhan Ture • Dissertation defense • May 24th, 2013 • Department of Computer ScienceUniversity of Maryland at College Park
Motivation • forum posts • multi-lingual text • clustered summaries • user’s native language • Fact 1: People want to access information e.g., web pages, videos, restaurants, products, … • Fact 2: Lots of data out there … but also lots of noise, redundancy, different languages • Goal: Find ways to efficiently and effectively • Search complex, noisy data • Deliver content in appropriate form
Information Retrieval Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). In our work, we assume that the material is a collection of documents written in natural language, and the information need is provided in the form of a query, ranging from a few words to an entire document. A typical approach in IR is to represent each document as a vector of weighted terms, where a term usually means either a word or its stem. A pre-determined list of stop words (e.g., ``the'', ``an'', ``my'') may be removed from the set of terms, since they have found to create noise in the search process. Documents are scored, relative to the query, usually by scoring each query term independently and aggregating these term-document scores. queri:11.69, ir:11.39, vector:7.93, document:7.09, nois:6.92, stem:6.56, score:5.68, weight:5.67, word:5.46, materi:5.42, search:5.06, term:5.03, text:4.87, comput:4.73, need:4.61, collect:4.48, natur:4.36, languag:4.12, find:3.92, repres:3.58 retrievir find materi (usual document unstructurnatur (usual text satisfi need larg collect (usual store comput work assummateri collect document written naturlanguag need form queri rang word entir document typic approach irrepres document vector weight term term mean word stem pre-determin list word .g. `` '' `` '' `` '' may remov set term found creatnois search process document score relatqueri score queri term independaggreg term-docu score
Cross-Language Information Retrieval Information Retrieval (IR) bzw. Informationsrückgewinnung, gelegentlich ungenau Informationsbeschaffung, ist ein Fachgebiet, das sich mit computergestütztem Suchen nach komplexen Inhalten (also z. B. keine Einzelwörter) beschäftigt und in die Bereiche Informationswissenschaft, Informatik und Computerlinguistik fällt. Wie aus der Wortbedeutung von retrieval (deutsch Abruf, Wiederherstellung) hervorgeht, sind komplexe Texte oder Bilddaten, die in großen Datenbanken gespeichert werden, für Außenstehende zunächst nicht zugänglich oder abrufbar. Beim Information Retrieval geht es darum bestehende Informationen aufzufinden, nicht neue Strukturen zu entdecken (wie beim Knowledge Discovery in Databases, zu dem das Data Mining und Text Mining gehören). 89,933 2,345 221,932 106,134 92,541 4,073 - - 162,671 78,346 241,580 19,318 5,802 327,094 104,822 23,890 95,936 187,349 9,394 3.4 2.9 2.7 2.5 2.4 2.1 2 1.8 1.8 1.7 1.7 1.5 1.5 1.5 1.4 1.1 1.0 0.9 0.8
Machine Translation Machine translation (MT) is to translate text written in a source language into corresponding text in a target language. Maschinelle Übersetzung (MT) ist, um Text in einer Ausgangssprache in entsprechenden Text in der Zielsprache geschrieben übersetzen.
Motivation Cross-language IR • multi-lingual text • user’s native language • MT • Fact 1: People want to access information e.g., web pages, videos, restaurants, products, … • Fact 2: Lots of data out there … but also lots of noise, redundancy, different languages • Goal: Find ways to efficiently and effectively • Search complex, noisy data • Deliver content in appropriate form
Outline (Ture et al., SIGIR’11) (Ture and Lin, NAACL’12) (Ture et al., SIGIR’12), (Ture et al., COLING’12) (Ture and Lin, SIGIR’13) • Introduction • Searching to Translate (IRMT) • Cross-Lingual Pairwise Document Similarity • Extracting Parallel Text From Comparable Corpora • Translating to Search (MTIR) • Context-Sensitive Query Translation • Conclusions
Extracting Parallel Text from the Web Phase 1 source collection F doc vectorsF Preprocess Signature Generation signaturesF Sliding Window Algorithm target collection E doc vectorsE Preprocess Signature Generation signaturesE cross-lingual document pairs Phase 2 candidate sentence pairs Candidate Generation 2-step Parallel Text Classifier aligned bilingual sentence pairs (F-E parallel text)
Pairwise Similarity • Pairwise similarity: • finding similar pairs of documents in a large collection • Challenges • quadratic search space • measuring similarity effectively and efficiently • Focus on recalland scalability
Locality-Sensitive Hashing Ne English articles Preprocess Similar article pairs Ne English document vectors <nobel=0.324, prize=0.227, book=0.01, …> [0111000010...] Sliding window algorithm Signature generation Ne Signatures
Locality-Sensitive Hashing (Ravichandran et al., 2005) • LSH(vector) = signature • faster similarity computation s.t. similarity(vector pair) ≈ similarity(signature pair) e.g., • ~20 times faster than computing (cosine) similarity from vectors • similarity error ≈ 0.03 • Sliding window algorithm • approximate similarity search based on LSH • linear run-time
Sliding window algorithm Generating tables permute sort list1 table1 p1 …. 11111101010,1 10011000110,2 01100100100,3 … …. 01100100100,1 10011000110,2 11111101010,3 … . . . . . . Signatures …. 1,11011011101 2,01110000101 3,10101010000 … sort tableQ listQ pQ …. 11111001011,1 00101001110,2 10010000101,3 … …. 00101001110,1 10010000101,2 11111001011,3 … Map Reduce
Sliding window algorithm Detecting similar pairs 00000110101 00010001111 00100101101 00110000000 00110010000 00110011111 00110101000 00111010010 10010011011 10010110011 table1 …. 01100100100,1 10011000110,2 11111101010,3 … . . . tableQ Map
Sliding window algorithm Example # tables = 2 window size = 2 list1 table1 p1 ✗ Distance(3,2) = 7 Distance(2,1) = 5 # bits = 11 ✓ (<1,11111101010>,1) (<1,10011000110>,2) (<1,01100100100>,3) (<1,01100100100>,3) (<1,10011000110>,2) (<1,11111101010>,1) Signatures list2 table2 p2 (1,11011011101) (2,01110000101) (3,10101010000) ✗ Distance(2,3) = 7 Distance(3,1) = 6 ✓ (<2,11111001011>,1) (<2,00101001110>,2) (<2,10010000101>,3) (<2,00101001110>,2) (<2,10010000101>,3) (<2,11111001011>,1) Map Reduce
MT Cross-lingual Pairwise Similarity English German MT translate Doc A doc vector vA English Doc B doc vector vB CLIR German CLIR translate Doc A doc vector vA doc vector vA English Doc B doc vector vB 16
MT vs. CLIR for Pairwise Similarity clir-neg clir-pos mt-neg mt-pos positive-negative clearly separated low similarity values MT slightly better than CLIR, but 600 times slower!
Locality-Sensitive Hashing for Pairwise Similarity Ne English articles Preprocess Similar article pairs Ne English document vectors <nobel=0.324, prize=0.227, book=0.01, …> [0111000010...] Sliding window algorithm Signature generation Ne Signatures
Locality-Sensitive Hashing for Cross-Lingual Pairwise Similarity Nf German articles Ne English articles CLIR Translate Preprocess Similar article pairs Ne English document vectors Ne+Nf English document vectors <nobel=0.324, prize=0.227, book=0.01, …> [0111000010...] Sliding window algorithm Signature generation Ne Signatures
Evaluation # bits (D) = 1000 # tables (Q) = 100-1500 window size (B) = 100-2000 • Experiments with De/Es/Cs/Ar/Zh/Tr to En Wikipedia • Collection: 3.44m En + 1.47m De Wikipedia articles • Task: For each German Wikipedia article, find: {all English articles s.t. cosine similarity > 0.30}
Evaluation Sliding window algorithm Signature generation Similar article pairs Signatures document vectors algorithm output two sources of error ground truth Similar article pairs Brute-force approach Signatures Similar article pairs Brute-force approach upperbound document vectors
Evaluation 100% recall no savings = no free lunch! 95% recall 39% cost 99% recall 70% cost 99% recall 62% cost 95% recall 40% cost
Outline (Ture et al., SIGIR’11) (Ture and Lin, NAACL’12) (Ture et al., SIGIR’12), (Ture et al., COLING’12) (Ture and Lin, SIGIR’13) • Introduction • Searching to Translate (IRMT) • Cross-Lingual Pairwise Document Similarity • Extracting Parallel Text From Comparable Corpora • Translating to Search (MTIR) • Context-Sensitive Query Translation • Conclusions
Phase 2: Extracting Parallel Text Approach • Generate candidate sentence pairs from each document pair • Classify each candidate as ‘parallel’ or ‘not parallel’ Challenge:10s millions doc pairs ≈ 100s billions sentence pairs Solution: 2-step classification approach • a simple classifier efficiently filters out irrelevant pairs • a complex classifier effectively classifies remaining pairs
Parallel Text (Bitext) Classifier • cosine similarityof the two sentences • sentence length ratio:the ratio of lengths of the two sentences • word translation ratio:ratio of words in source (target) sentence with a translation in target (source) sentence
Bitext Extraction Algorithm cross-lingual document pairs candidate generation 2.4 hours sentences and sent. vectors sentence detection+tf-idf source document target document 400 billion 214 billion shuffle&sort 1.3 hours MAP cartesian product REDUCE X complex classification 0.5 hours sentence pairs 132 billion simple classification 4.1 hours simple classification bitext S2 complex classification bitextS1
Conclusions (Part I) • Summary • Scalable approach to extract parallel text from a comparable corpus • Improvements over state-of-the-art MT baseline • General algorithm applicable to any data format • Future work • Domain adaptation • Experimenting with larger web collections
Outline (Ture et al., SIGIR’11) (Ture and Lin, NAACL’12) (Ture et al., SIGIR’12), (Ture et al., COLING’12) (Ture and Lin, SIGIR’13) • Introduction • Searching to Translate (IRMT) • Cross-Lingual Pairwise Document Similarity • Extracting Parallel Text From Comparable Corpora • Translating to Search (MTIR) • Context-Sensitive Query Translation • Conclusions
Cross-Language Information Retrieval (ranked) documents query • Information Retrieval (IR): Given information need, find relevant material. • Cross-language IR (CLIR): query and documents in different languages • “Why does China want to import technology to build Maglev Railway?” • relevant information in Chinesedocuments • “Maternal Leave in Europe” • relevant information in French, Spanish, German, etc.
Machine Translation for CLIR sentence-aligned parallel corpus STATISTICAL MT SYSTEM token aligner token alignments grammar extractor query token translation probabilities “maternal leave in Europe” translation grammar decoder language model language model 1-best translation “congé de maternité en Europe” n best translations
Token-based CLIR … most leave their children in … ... aim of extending maternity leave to … . . . … la plupart laisse leurs enfants… … l’objectif de l’extension des congé de maternité à … . . . Token-based probabilities Token translation formula
Token-based CLIR Maternal leave in Europe laisser (Eng. forget) 49% congé (Eng. time off) 17% quitter (Eng. quit) 9% partir (Eng. disappear) 7% …
Document Retrieval Query q1 “maternal leave in Europe” Document Document Document Document d1 [maternité: 0.74, maternel: 0.26] tf(maternel) How to score a document, given a query? tf(maternité) df(maternité) df(maternel) …
Token-based CLIR Maternal leave in Europe laisser (Eng. forget) 49% congé (Eng. time off) 17% quitter (Eng. quit) 9% partir (Eng. disappear) 7% …
Token-based CLIR Maternal leave in Europe laisser (Eng. forget) 49% congé (Eng. time off) 17% quitter (Eng. quit) 9% partir (Eng. disappear) 7% … laisser (Eng. forget) 49% congé (Eng. time off) 17% quitter (Eng. quit) 9% partir (Eng. disappear) 7% …
Context-Sensitive CLIR Maternal leave in Europe 12% 70% 6% 5% laisser (Eng. forget) 49% congé (Eng. time off) 17% quitter (Eng. quit) 9% partir (Eng. disappear) 7% … laisser (Eng. forget) 49% congé (Eng. time off) 17% quitter (Eng. quit) 9% partir (Eng. disappear) 7% … This talk: MT for context-sensitive CLIR
Previous approach: MT as black box Our approach: Looking inside the box Previous approach: Token-based CLIR sentence-aligned parallel corpus MT STATISTICAL MT SYSTEM token aligner token translation probabilities token alignments grammar extractor query “maternal leave in Europe” translation grammar decoder language model language model 1-best translation “congé de maternité en Europe” n best derivations n best derivations
MT for Context-Sensitive CLIR • sentence-aligned parallel corpus MT token aligner token translation probabilities token alignments grammar extractor query “maternal leave in Europe” translation grammar decoder language model 1-best translation “congé de maternité en Europe” n best translations
CLIR from translation grammar S [X : X] , 1.0 X [X1leave ineurope: congé de X1eneurope] , 0.9 X [maternal : maternité] , 0.9 X [X1leave : congé de X1] , 0.74 X [leave : congé ] , 0.17 X [leave : laisser] , 0.49 ... S1 S1 X1 X1 Grammar-based probabilities Synchronous Context-Free Grammar (SCFG) [Chiang, 2007] X2 leavein Europe Synchronous hierarchical derivation Token translation formula congé de en Europe X2 maternal maternité
MT for Context-Sensitive CLIR • sentence-aligned parallel corpus MT token aligner token translation probabilities token alignments grammar extractor query “maternal leave in Europe” translation grammar decoder language model 1-best translation “congé de maternité en Europe” n best translations
MT for Context-Sensitive CLIR • sentence-aligned parallel corpus MT token aligner token translation probabilities token alignments grammar extractor query “maternal leave in Europe” translation grammar decoder language model 1-best translation “congé de maternité en Europe” n best translations
CLIR from n-best derivations t(2): { , 0.11 } t(1): { , 0.8 } S1 X1 en Europe . . . congé de maternité t(k): { kth best derivation , score(t(k)|s) } • Token translation formula S1 S1 S1 X1 X1 X1 in Europe Translation-based probabilities maternal leave leavein Europe X2 congé de en Europe X2 maternal maternité
MT for Context-Sensitive CLIR n best derivations sentence-aligned bitext translation grammar token alignments 1-best translation MT pipeline 1-best MT Prnbest Context sensitivity PrSCFG translation based Prtoken grammar based Ambiguity preserved token based
Combining Evidence PrSCFG Prnbest Prtoken leave laisser 0.72 congé 0.10 quitter 0.09 … leave laisser 0.14 congé 0.70 quitter 0.06 … leave laisser 0.09 congé 0.90 quitter 0.11 … 40% For best results, we compute an interpolated probability distribution: 35% 25% Printerp leave laisser 0.33 congé 0.54 quitter 0.8 …
Combining Evidence PrSCFG Prnbest Prtoken leave laisser 0.72 congé 0.10 quitter 0.09 … leave laisser 0.14 congé 0.70 quitter 0.06 … leave laisser 0.09 congé 0.90 quitter 0.11 … 0% For best results, we compute an interpolated probability distribution: 100% 0% leave laisser 0.72 congé 0.10 quitter 0.09 … Printerp
Combining Evidence For best results, we compute an interpolated probability distribution: