Ferhan Ture Dissertation defense May 24 th , 2013

“Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation • Ferhan Ture • Dissertation defense • May 24th, 2013 • Department of Computer ScienceUniversity of Maryland at College Park

Motivation • forum posts • multi-lingual text • clustered summaries • user’s native language • Fact 1: People want to access information e.g., web pages, videos, restaurants, products, … • Fact 2: Lots of data out there … but also lots of noise, redundancy, different languages • Goal: Find ways to efficiently and effectively • Search complex, noisy data • Deliver content in appropriate form

Information Retrieval Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). In our work, we assume that the material is a collection of documents written in natural language, and the information need is provided in the form of a query, ranging from a few words to an entire document. A typical approach in IR is to represent each document as a vector of weighted terms, where a term usually means either a word or its stem. A pre-determined list of stop words (e.g., ``the'', ``an'', ``my'') may be removed from the set of terms, since they have found to create noise in the search process. Documents are scored, relative to the query, usually by scoring each query term independently and aggregating these term-document scores. queri:11.69, ir:11.39, vector:7.93, document:7.09, nois:6.92, stem:6.56, score:5.68, weight:5.67, word:5.46, materi:5.42, search:5.06, term:5.03, text:4.87, comput:4.73, need:4.61, collect:4.48, natur:4.36, languag:4.12, find:3.92, repres:3.58 retrievir find materi (usual document unstructurnatur (usual text satisfi need larg collect (usual store comput work assummateri collect document written naturlanguag need form queri rang word entir document typic approach irrepres document vector weight term term mean word stem pre-determin list word .g. `` '' `` '' `` '' may remov set term found creatnois search process document score relatqueri score queri term independaggreg term-docu score

Cross-Language Information Retrieval Information Retrieval (IR) bzw. Informationsrückgewinnung, gelegentlich ungenau Informationsbeschaffung, ist ein Fachgebiet, das sich mit computergestütztem Suchen nach komplexen Inhalten (also z. B. keine Einzelwörter) beschäftigt und in die Bereiche Informationswissenschaft, Informatik und Computerlinguistik fällt. Wie aus der Wortbedeutung von retrieval (deutsch Abruf, Wiederherstellung) hervorgeht, sind komplexe Texte oder Bilddaten, die in großen Datenbanken gespeichert werden, für Außenstehende zunächst nicht zugänglich oder abrufbar. Beim Information Retrieval geht es darum bestehende Informationen aufzufinden, nicht neue Strukturen zu entdecken (wie beim Knowledge Discovery in Databases, zu dem das Data Mining und Text Mining gehören). 89,933 2,345 221,932 106,134 92,541 4,073 - - 162,671 78,346 241,580 19,318 5,802 327,094 104,822 23,890 95,936 187,349 9,394 3.4 2.9 2.7 2.5 2.4 2.1 2 1.8 1.8 1.7 1.7 1.5 1.5 1.5 1.4 1.1 1.0 0.9 0.8

Machine Translation Machine translation (MT) is to translate text written in a source language into corresponding text in a target language. Maschinelle Übersetzung (MT) ist, um Text in einer Ausgangssprache in entsprechenden Text in der Zielsprache geschrieben übersetzen.

Motivation Cross-language IR • multi-lingual text • user’s native language • MT • Fact 1: People want to access information e.g., web pages, videos, restaurants, products, … • Fact 2: Lots of data out there … but also lots of noise, redundancy, different languages • Goal: Find ways to efficiently and effectively • Search complex, noisy data • Deliver content in appropriate form

Outline (Ture et al., SIGIR’11) (Ture and Lin, NAACL’12) (Ture et al., SIGIR’12), (Ture et al., COLING’12) (Ture and Lin, SIGIR’13) • Introduction • Searching to Translate (IRMT) • Cross-Lingual Pairwise Document Similarity • Extracting Parallel Text From Comparable Corpora • Translating to Search (MTIR) • Context-Sensitive Query Translation • Conclusions

Extracting Parallel Text from the Web Phase 1 source collection F doc vectorsF Preprocess Signature Generation signaturesF Sliding Window Algorithm target collection E doc vectorsE Preprocess Signature Generation signaturesE cross-lingual document pairs Phase 2 candidate sentence pairs Candidate Generation 2-step Parallel Text Classifier aligned bilingual sentence pairs (F-E parallel text)

Pairwise Similarity • Pairwise similarity: • finding similar pairs of documents in a large collection • Challenges • quadratic search space • measuring similarity effectively and efficiently • Focus on recalland scalability

Locality-Sensitive Hashing Ne English articles Preprocess Similar article pairs Ne English document vectors <nobel=0.324, prize=0.227, book=0.01, …> [0111000010...] Sliding window algorithm Signature generation Ne Signatures

Locality-Sensitive Hashing (Ravichandran et al., 2005) • LSH(vector) = signature • faster similarity computation s.t. similarity(vector pair) ≈ similarity(signature pair) e.g., • ~20 times faster than computing (cosine) similarity from vectors • similarity error ≈ 0.03 • Sliding window algorithm • approximate similarity search based on LSH • linear run-time

Sliding window algorithm Generating tables permute sort list1 table1 p1 …. 11111101010,1 10011000110,2 01100100100,3 … …. 01100100100,1 10011000110,2 11111101010,3 … . . . . . . Signatures …. 1,11011011101 2,01110000101 3,10101010000 … sort tableQ listQ pQ …. 11111001011,1 00101001110,2 10010000101,3 … …. 00101001110,1 10010000101,2 11111001011,3 … Map Reduce

Sliding window algorithm Detecting similar pairs 00000110101 00010001111 00100101101 00110000000 00110010000 00110011111 00110101000 00111010010 10010011011 10010110011 table1 …. 01100100100,1 10011000110,2 11111101010,3 … . . . tableQ Map

Sliding window algorithm Example # tables = 2 window size = 2 list1 table1 p1 ✗ Distance(3,2) = 7 Distance(2,1) = 5 # bits = 11 ✓ (<1,11111101010>,1) (<1,10011000110>,2) (<1,01100100100>,3) (<1,01100100100>,3) (<1,10011000110>,2) (<1,11111101010>,1) Signatures list2 table2 p2 (1,11011011101) (2,01110000101) (3,10101010000) ✗ Distance(2,3) = 7 Distance(3,1) = 6 ✓ (<2,11111001011>,1) (<2,00101001110>,2) (<2,10010000101>,3) (<2,00101001110>,2) (<2,10010000101>,3) (<2,11111001011>,1) Map Reduce

MT Cross-lingual Pairwise Similarity English German MT translate Doc A doc vector vA English Doc B doc vector vB CLIR German CLIR translate Doc A doc vector vA doc vector vA English Doc B doc vector vB 16

MT vs. CLIR for Pairwise Similarity clir-neg clir-pos mt-neg mt-pos positive-negative clearly separated low similarity values MT slightly better than CLIR, but 600 times slower!

Locality-Sensitive Hashing for Pairwise Similarity Ne English articles Preprocess Similar article pairs Ne English document vectors <nobel=0.324, prize=0.227, book=0.01, …> [0111000010...] Sliding window algorithm Signature generation Ne Signatures

Locality-Sensitive Hashing for Cross-Lingual Pairwise Similarity Nf German articles Ne English articles CLIR Translate Preprocess Similar article pairs Ne English document vectors Ne+Nf English document vectors <nobel=0.324, prize=0.227, book=0.01, …> [0111000010...] Sliding window algorithm Signature generation Ne Signatures

Evaluation # bits (D) = 1000 # tables (Q) = 100-1500 window size (B) = 100-2000 • Experiments with De/Es/Cs/Ar/Zh/Tr to En Wikipedia • Collection: 3.44m En + 1.47m De Wikipedia articles • Task: For each German Wikipedia article, find: {all English articles s.t. cosine similarity > 0.30}

Scalability

Evaluation Sliding window algorithm Signature generation Similar article pairs Signatures document vectors algorithm output two sources of error ground truth Similar article pairs Brute-force approach Signatures Similar article pairs Brute-force approach upperbound document vectors

Evaluation 100% recall no savings = no free lunch! 95% recall 39% cost 99% recall 70% cost 99% recall 62% cost 95% recall 40% cost

Phase 2: Extracting Parallel Text Approach • Generate candidate sentence pairs from each document pair • Classify each candidate as ‘parallel’ or ‘not parallel’ Challenge:10s millions doc pairs ≈ 100s billions sentence pairs Solution: 2-step classification approach • a simple classifier efficiently filters out irrelevant pairs • a complex classifier effectively classifies remaining pairs

Parallel Text (Bitext) Classifier • cosine similarityof the two sentences • sentence length ratio:the ratio of lengths of the two sentences • word translation ratio:ratio of words in source (target) sentence with a translation in target (source) sentence

Bitext Extraction Algorithm cross-lingual document pairs candidate generation 2.4 hours sentences and sent. vectors sentence detection+tf-idf source document target document 400 billion 214 billion shuffle&sort 1.3 hours MAP cartesian product REDUCE X complex classification 0.5 hours sentence pairs 132 billion simple classification 4.1 hours simple classification bitext S2 complex classification bitextS1

Extracting Bitext from Wikipedia

Evaluation on MT

Conclusions (Part I) • Summary • Scalable approach to extract parallel text from a comparable corpus • Improvements over state-of-the-art MT baseline • General algorithm applicable to any data format • Future work • Domain adaptation • Experimenting with larger web collections

Cross-Language Information Retrieval (ranked) documents query • Information Retrieval (IR): Given information need, find relevant material. • Cross-language IR (CLIR): query and documents in different languages • “Why does China want to import technology to build Maglev Railway?” • relevant information in Chinesedocuments • “Maternal Leave in Europe” • relevant information in French, Spanish, German, etc.

Machine Translation for CLIR sentence-aligned parallel corpus STATISTICAL MT SYSTEM token aligner token alignments grammar extractor query token translation probabilities “maternal leave in Europe” translation grammar decoder language model language model 1-best translation “congé de maternité en Europe” n best translations

Token-based CLIR … most leave their children in … ... aim of extending maternity leave to … . . . … la plupart laisse leurs enfants… … l’objectif de l’extension des congé de maternité à … . . . Token-based probabilities Token translation formula

Token-based CLIR Maternal leave in Europe laisser (Eng. forget) 49% congé (Eng. time off)  17% quitter (Eng. quit)  9% partir (Eng. disappear) 7% …

Document Retrieval Query q1 “maternal leave in Europe” Document Document Document Document d1 [maternité: 0.74, maternel: 0.26] tf(maternel) How to score a document, given a query? tf(maternité) df(maternité) df(maternel) …

Token-based CLIR Maternal leave in Europe laisser (Eng. forget) 49% congé (Eng. time off)  17% quitter (Eng. quit)  9% partir (Eng. disappear) 7% …

Token-based CLIR Maternal leave in Europe laisser (Eng. forget) 49% congé (Eng. time off)  17% quitter (Eng. quit)  9% partir (Eng. disappear) 7% … laisser (Eng. forget) 49% congé (Eng. time off)  17% quitter (Eng. quit)  9% partir (Eng. disappear) 7% …

Context-Sensitive CLIR Maternal leave in Europe 12% 70% 6% 5% laisser (Eng. forget) 49% congé (Eng. time off)  17% quitter (Eng. quit)  9% partir (Eng. disappear) 7% … laisser (Eng. forget) 49% congé (Eng. time off)  17% quitter (Eng. quit)  9% partir (Eng. disappear) 7% … This talk: MT for context-sensitive CLIR

Previous approach: MT as black box Our approach: Looking inside the box Previous approach: Token-based CLIR sentence-aligned parallel corpus MT STATISTICAL MT SYSTEM token aligner token translation probabilities token alignments grammar extractor query “maternal leave in Europe” translation grammar decoder language model language model 1-best translation “congé de maternité en Europe” n best derivations n best derivations

MT for Context-Sensitive CLIR • sentence-aligned parallel corpus MT token aligner token translation probabilities token alignments grammar extractor query “maternal leave in Europe” translation grammar decoder language model 1-best translation “congé de maternité en Europe” n best translations

CLIR from translation grammar S  [X : X] , 1.0 X  [X1leave ineurope: congé de X1eneurope] , 0.9 X  [maternal : maternité] , 0.9 X  [X1leave : congé de X1] , 0.74 X  [leave : congé ] , 0.17 X  [leave : laisser] , 0.49 ... S1 S1 X1 X1 Grammar-based probabilities Synchronous Context-Free Grammar (SCFG) [Chiang, 2007] X2 leavein Europe Synchronous hierarchical derivation Token translation formula congé de en Europe X2 maternal maternité

MT for Context-Sensitive CLIR • sentence-aligned parallel corpus MT token aligner token translation probabilities token alignments grammar extractor query “maternal leave in Europe” translation grammar decoder language model 1-best translation “congé de maternité en Europe” n best translations

CLIR from n-best derivations t(2): { , 0.11 } t(1): { , 0.8 } S1 X1 en Europe . . . congé de maternité t(k): { kth best derivation , score(t(k)|s) } • Token translation formula S1 S1 S1 X1 X1 X1 in Europe Translation-based probabilities maternal leave leavein Europe X2 congé de en Europe X2 maternal maternité

MT for Context-Sensitive CLIR n best derivations sentence-aligned bitext translation grammar token alignments 1-best translation MT pipeline 1-best MT Prnbest Context sensitivity PrSCFG translation based Prtoken grammar based Ambiguity preserved token based

Combining Evidence PrSCFG Prnbest Prtoken leave laisser 0.72 congé 0.10 quitter 0.09 … leave laisser 0.14 congé 0.70 quitter 0.06 … leave laisser 0.09 congé 0.90 quitter 0.11 … 40% For best results, we compute an interpolated probability distribution: 35% 25% Printerp leave laisser 0.33 congé 0.54 quitter 0.8 …

Combining Evidence PrSCFG Prnbest Prtoken leave laisser 0.72 congé 0.10 quitter 0.09 … leave laisser 0.14 congé 0.70 quitter 0.06 … leave laisser 0.09 congé 0.90 quitter 0.11 … 0% For best results, we compute an interpolated probability distribution: 100% 0% leave laisser 0.72 congé 0.10 quitter 0.09 … Printerp

Combining Evidence For best results, we compute an interpolated probability distribution:

Ferhan Ture Dissertation defense May 24 th , 2013

Ferhan Ture Dissertation defense May 24 th , 2013

Presentation Transcript

Ph.D. Dissertation Defense

Dissertation Oral Defense

2013 OPEN ENROLLMENT May 1, 2013 May 24, 2013

DISSERTATION DEFENSE

Malta, 24 th May 2012

Dissertation Defense

Bellwork May 24 th :

RDAV Conference 24 th May

17 TH May 2013

28 th May 2013

24 th April 2013

Monday 24 th May 2010

Dissertation Defense Presentation

John A Tran Ph.D. Dissertation Defense October 24 th , 2008

24 th June 2013

Saturday, May 24 th , 2014

24 th of May, 2013

May 24 th , 2005

Dissertation Defense

Dissertation Defense

Arusha 24-25 th May