140 likes | 270 Views
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling. Ferhan Ture and Jimmy Lin University of Maryland, College Park NAACL-HLT’12 June 6, 2012. Extracting Bilingual Text. Problem: Mine bitext from comparable corpora
E N D
Why Not Grab a Free Lunch? Mining Large Corpora forParallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland, College Park NAACL-HLT’12 June 6, 2012
Extracting Bilingual Text Problem: Mine bitext from comparable corpora Application: Improve quality of MT models Approach: Phase 1 Identify similar document pairs from comparable corpora Phase 2 • Generate candidate sentence pairs • Classify each candidate as ‘parallel’ or ‘not parallel’
Extracting Bilingual Text “No Free Lunch: Brute Force vs. Locality-Sensitive Hashing for Cross-lingual Pairwise Similarity”. Ture et al. SIGIR 2011. Phase 1 source-language collection F 1.5m German Wikipedia articles 64m German-English article pairs docvectorsF Signature Generation cross-lingual document pairs target-language collection E 3.5m English Wikipedia articles docvectorsE Signature Generation Phase 2 aligned bilingual sentence pairs candidate sentence pairs Candidate Generation 2-step Classifier
Extracting Bilingual Text Challenge: 64m document pairs hundreds of billions sentence pairs Solution: 2-step classification approach • a simple classifier efficiently filters out irrelevant pairs • a complex classifier effectively classifies remaining pairs
Related Work Extracting bitext from web pages (Resnik&Smith’03), news stories (Munteanu&Marcu’05), and Wikipedia articles (Smith et al’10). • no heuristics on document/time structure (i.e., generalizable) • scalable implementation Recent Google paper with similar motivation (Uszkoreitet al’10) • far less computational resources • control “efficiency vs effectiveness” • not simply “more data is better” • significant results with much less data
Bitext Classifier Features • cosine similarity of the two sentences s1 and s2 where u1 and u2 are vector representations of s1 and s2 • sentence length ratio: the ratio of lengths of the two sentences • word translation ratio: ratio of words in s1that have translations in s2, (only consider translations with at least 0.01 probability)
Bitext Classifier Evaluation • Maximum entropy classifier (OpenNLP-MaxEnt) • Europarl v6 German-English corpus Trained on1000 parallel, 5000 non-parallel (sampled from all possible) Tested on 1000 parallel, 999000 non-parallel (all possible) good out-of-domain performance comparable with Smith et al’10 4 times faster
MapReduce • Easy-to-understand programming model for designing scalable and distributed algorithms • Experiments on Hadoop cluster • 12 nodes, each with 2 quad-core 2.2GHz Intel processors, 24 GB RAM
Bitext Extraction Algorithm candidate generation 2.4 hours cross-lingual document pairs (ne,nf) sentences and sent. vectors ({se}’,{ve}’) ({sf}’,{vf}’) sentence detection+tf-idf ({se},{ve}) ({sf},{vf}) source (ne , de) target (nf, df) shuffle&sort 1.25 hours cartesian product MAP < ne , de > ↦ < (ne , nf) , ({se}’,{ve}’) > REDUCE < (ne , nf) , ({se}’,{ve}’,{sf}’,{vf}’) > ↦ <(ne , nf) , (se , sf)> X {ve,vf)} complex classification 0.52 hours simple classification 4.13 hours {(ve,vf)’} simple classification complex classification bitextS1 bitextS2
Evaluation on MT Train with GIZA++, Hiero-style SCFG Tune with MIRA on WMT10 development set (2525 sentences) Decode with cdec (2489 sentences) using 5-gram English LM (SRILM) Baseline system all standard cdec features 21.37 BLEU on test set 5th out of 9 WMT10 teams with comparable results best teams use novel techniques to exploit specific aspects strong and competitive baseline
End-to-End Experiments Candidate generation 64 million German-English article pairs from phase 1 • 400 billion candidate sentence pairs • 214 billion after (# terms ≥3 and sentence length ≥5) • 132 billion after (1/2 < sentence length ratio < 2) random sampling WMT10 train simple > 0.98 complex > 0.60 complex > 0.65 2-step WMT10 train simple > 0.986 simple > 0.992 1-step data size (in millions) 0 3.1 5.3 8.1 16.9
Evaluation on MT S2>S1 consistently random > S2 when low-scoring sentence pairs may be helpful in MT turning point when the benefits of more data exceeds the extra noise introduced 2.39 BLEU improvement over baseline Baseline = 21.37
Conclusions • Built approach to extract parallel sentences from freely available resources • 5m sentence pairs highest BLEU in WMT’10 • data-driven > task-specific engineering • Why not grab a free lunch? • We plan to extend to more language pairs and share our findings with the community • All of our code and data is freely available
Thank you! Code: ivory.cc Data: www.github.com/ferhanture/WikiBitext