310 likes | 412 Views
Large Scale Crawling the Web for Parallel Texts. Chikayama Taura lab. M1 Dai Saito. One thing was certain, that the WHITE kitten had had nothing to do with it. --it was the black kitten's fault entirely. 一つ確実なのは、 白い子ネコはなんの関係も なかったということ。. ―― もうなにもかも、 黒い子ネコのせいだったのです。. Parallel Texts.
E N D
Large Scale Crawling the Web for Parallel Texts Chikayama Taura lab. M1 Dai Saito
One thing was certain, that the WHITE kitten had had nothing to do with it. --it was the black kitten's fault entirely. 一つ確実なのは、 白い子ネコはなんの関係も なかったということ。 ――もうなにもかも、 黒い子ネコのせいだったのです。 Parallel Texts • Parallel texts : • Parallel corpus : a set of parallel texts Translated pair of multilingual texts 日本語 English
Parallel Texts • Useful resource for • Statistical machine translation • Dictionary construction • But… existing corpora are small • Number • Not enough • Need human resource • Language • English-French • Genre • Public Document • Software Manual
Parallel Texts from the Web • Crawling parallel texts from the Web • Very large number of texts exist • Varied languages are used • Low human resource Problems - How to detect parallel texts automatically - Calculation cost :
① Not parallel ① ② ② Not parallel Parallel Texts from the Web Maybe parallel Web Parallel Texts Not parallel Parallel texts
Agenda • Introduction • Related work • Proposal • Detecting parallel texts • Large scale crawling • Experiment • Conclusion
STRAND [Resnik et al. 03] • URL Matching • Removing language-specific substrings[LSSs](Japanese : ja, jp, jpn, euc, sjis,…) • Matching LSSs-removed URLs • Making a detailed comparison http://www.hostname.com/index.html.en http://www.hostname.com/index.html.en http://www.hostname.com/index.html.ja http://www.hostname.com/index.html.ja
URL Matching Experiment • URL Matching for URLs of crawled pages • 90,000,000URLs • English⇔Japanese • Seeing only URL • 90,000,000 →4,000 • Too strict? • Useless pages are included japanese.php english.php index.html.ja index.html.en
DOM Tree Alignment [Lei et al. 06] link • Searching linked pages • “alt” tag • link name • HTML→DOM Tree • Parallel link: a pair of the same hyperlinks in parallel texts “English version” “In English” etc… link
Pros and Cons • URL Matching High speed and Easy to implement Small number of pages • DOM Tree High accuracy and Small storage Execution speed is slow ○ × ○ ×
Agenda • Introduction • Related work • Proposal • Detecting parallel texts • Large scale crawling • Experiment • Conclusion
Detecting Parallel Texts • [Fukushima 06] • Reducing comparison cost • without HTML Information • word(noun)→semantic ID→comparison
Semantic ID Conversion • Constructing a graph from dictionaries • Treating Japaneseand English texts on same level • # of Semantic ID:about 10,000 1 Sense 感覚 意味 Movie 2 映画 Film Hobby 趣味 3 Taste 味
Texts to Vector テキスト 955 … 辞書 1704 … 数列 3173 辞書を使ってテキストを数列に変える。 1704 955 3173 sort (955, 1704, 3173) +position information
Comparison • tscore (translation score) T1:(106, 335, 455, 567, 1704, 3173, 7421) T2:(335, 567, 567, 1704, 4014, 5449, 7421) score= 0 1 2
tscore threshold • Fry Corpus[05 Fry] • F-measure • tscore threshold 0.102 • Speed 250,000 pairs/sec
Agenda • Introduction • Related work • Proposal • Detecting parallel texts • Large scale crawling • Experiment • Conclusion
Large Scale Crawling • Calculation cost of each comparison • Calculation cost of entire crawling • Number of comparisons: • URL matching is too strict • Alt tag or link name are not applied for all parallel pages
HTML on the Web to Natural Language • Guess language • English, SJIS, EUC-JP, UTF-8 • Convert character code • Remove HTML Tag • For crawling, <a> or <link> tag are used • <title>, <Hn> tag may be useful
Calculation Cost Reduction • Distance score of vectors • Compare only near vectors • distance score : tscore • Set a label of the nearest sample text for all texts Distance score of two texts is far, then,they are not parallel texts.
Calculation Cost Reduction • Flow • Select sample texts (<<n) • When crawling, calculate distance score with sample texts • Classify top m score • Compare only for texts in the same group
Sampling • Number of sample • Accuracy (risk of miss labeling) • Calculation cost • Size of the group • should be equal • Large group are divided into small recursively
Crawling link pages Same links from parallel texts will be parallel texts • Evaluation of same links • DOM Tree [Lei et al. 06] • Evaluate function • Position of <A> tag • Pages in same host • Diff of URLs • hoge.html.en -> fuga.html.en : hoge - fuga • hoge.html.ja -> fuga.html.ja : hoge – fuga
Agenda • Introduction • Related work • Proposal • Detecting parallel texts • Large scale crawling • Experiment • Conclusion
Evaluation of tscore • Fry Corpus [Fry 05] • 200(japanese) x 200(english) • Flow • Convert all texts to vector • Calculate distance score for all pairs(40000) • Check scores of real parallel texts are high • Score of parallel texts should be top
Evaluation of tscore (1,1,1,2,4,4,…) • NOT XOR (3,1,0,2,…) • Other distance score • AND sparse (3,1,0,2,0) (3,1,0,2,0) 3 2 • EUCLID • COS (3,0,0,1,2) (3,0,0,1,2) • AND - XOR (3,1,0,2,0) 0 (3,0,0,1,2)
Evaluation of tscore • Number of miss score ([200+200]texts)
Calculation Time • Fry Corpus • 200, 400, 800,1600, 3200 • NORMALtscore(Top3) • # of samples : √(# of All) • Miss labeling : 11 (in 200 pairs)
Agenda • Introduction • Related work • Proposal • Detecting parallel texts • Large scale crawling • Experiment • Conclusion
Conclusion and Future work • Parallel texts from the Web • Detecting parallel texts • Large scale crawling • Future work • Crawling many texts from the Web • Crawling with parallel link structure • Detecting parallel in real HTML texts • Proper sampling