340 likes | 504 Views
Extracting Parallel Texts from Massive Web Documents. Chikayama Taura lab. M2 Dai Saito. Construct Parallel Corpora from the Web. --it was the black kitten's fault entirely. One thing was certain, that the WHITE kitten had had nothing to do with it. ―― もうなにもかも、 黒い子ネコのせいだったのです。.
E N D
Extracting Parallel Texts from Massive Web Documents Chikayama Taura lab. M2 Dai Saito
Construct Parallel Corpora from the Web --it was the black kitten's fault entirely. One thing was certain, that the WHITE kitten had had nothing to do with it. ――もうなにもかも、 黒い子ネコのせいだったのです。 一つ確実なのは、 白い子ネコはなんの関係も なかったということ。 Purpose • Parallel corpus : a set of parallel texts • Parallel texts : translated pairs of texts 日本語 English
Parallel Texts • Useful resource for • Statistical machine translation • Dictionary construction • But… existing corpora are not enough • Amount • Small • Large human resource • Genre • Public Documents • Software Manuals • Language • Limited • English-French
Parallel Texts from the Web • Extracting Parallel Texts from Massive Web Documents • Very large amount of texts • Varied languages • Small human resource
Problems • How to detect parallel texts automatically • How to reduce calculation cost • To construct parallel corpus • Extract candidate pairs • Judge whether they really are parallel texts Web
Agenda • Introduction • Related work • Proposal • Detect parallel texts • Extract candidate pairs • Experiment • Conclusion
STRAND [Resnik et. al. 03] • URL Matching • Remove language-specific substrings[LSSs](Japanese : ja, jp, jpn, euc, sjis,…) • Match LSSs-removed URLs • Make a detail comparison http://www.hostname.com/index.html.en http://www.hostname.com/index.html.en http://www.hostname.com/index.html.ja http://www.hostname.com/index.html.ja
link link DOM Tree Alignment [Lei et. al. 06] • HTML→DOM Tree • Searching linked pages • “alt” tag • link name • Parallel link: a pair of the same hyperlinks in parallel texts “English version” “In English” etc…
Agenda • Introduction • Related work • Proposal • Detect parallel texts • Extract candidate pairs • Experiment • Conclusion
… … … … Outline Web Crawler Extract candidate pairs Detect parallel texts
Detecting parallel texts • Low comparison cost • without HTML Information • word (noun) • semantic ID • comparison [Fukushima et.al. 06]
Semantic ID Conversion • Constructing a graph from dictionaries • Treating Japaneseand English texts in the same level • # of Semantic IDs:about 10,000 1 Sense 感覚 意味 Movie 2 映画 Film Hobby 趣味 3 Taste 味
Texts to Vector テキスト 955 … 辞書 1704 … 数列 3173 辞書を使ってテキストを数列に変える。 1704 955 3173 sort (955, 1704, 3173) +position information
Comparison • tscore (translation score) T1:(106, 335, 455, 567, 1704, 3173, 7421) T2:(335, 567, 567, 1704, 4014, 5449, 7421) score= 0 1 2 3 4 tscore = 4/(7+7)
tscore threshold • Fry Corpus[05 Fry] 400 pair • F-measure • Speed 200,000 pairs/sec • tscore threshold 0.102
Agenda • Introduction • Related work • Proposal • Detect parallel texts • Extract candidate pairs • Experiment • Conclusion
Extract candidate pairs • Calculation cost of each comparison • Calculation cost of extracting parallel texts • A number of comparison: n^2 • URL matching is too strict • Japanese and English • 90,000,000URL → 4,000 URL pairs → 1,000 real pairs
Calculation Cost Reduction Sample →Reducing the number of comparison • distance score : tscore • Compare only texts close to each other Distance of each parallel texts and a sample text should be equal English 日本語
Calculation Cost Reduction • Flow • Select sample texts (<<n) • Calculate distance score with sample texts • Classify top m score • Compare only for texts in the same group
Sampling • Number of sample • Calculation cost • Accuracy (low risk of miss labeling) • Methods to select sample • Random • k-means
k-means k=2 • Select k samples • Classify all texts • Calculate centers • Re-classify
Calculation of tscore in k-means Text1:(106, 335, 455, 567, 1704, 3173, 7421) Text2:(335, 567, 567, 1704, 4014, 5449, 7421) tscore = 4/(7+7) normal k-means Text1:(106, 335, 455, 567, 1704, 3173, 7421) Average1:((567, 0.2), (4014, 0.14), (7421, 0.5), …) tscore = (0.2+0.5)
Converting HTML on the Web • Guess language • English, SJIS, EUC-JP, UTF-8 • Convert character code • Remove HTML Tag • Morphological Analysis→pickup noun
Agenda • Introduction • Related work • Proposal • Detect parallel texts • Extract candidate pairs • Experiment • Conclusion
Experiment • Calculation Cost • Accuracy v.s. Calculation time • Clustering • k-means
Environment • Dataset:Fry Corpus [Fry 05] • Corpus of Japanese-English news pages • Convert HTML to Semantic ID in advance • Machine • CPU : Xeon 2.4GHz Dual • Memory : 2GB • OS : Linux (Debian)
Calculation Cost • Fry Corpus • 200 - 6400 pairs Normal All-to-All Random sampling (Top3) • # of texts grows, gap becomes wider • Low cost with n^2 samples
Accuracy v.s. Calculation time • Fry Corpus • 400 pairs • Random sampling • # of sample grows, • Miss classification ratio → high • Execution time → low • Trade off with Miss classification ratio and Execution time
Sample selection with k-means • Accuracy and Execution time with k-means • Flow • Random sampling • number of samples : √n • Calculating the center and re-sampling • Measuring Miss-classification ratio and Execution time
Evaluation of k-means • Low miss-classification ratio→High biased
Agenda • Introduction • Related work • Proposal • Detect parallel texts • Extract candidate pairs • Experiment • Conclusion
Conclusion and Future work • Parallel texts from the Web • Detecting parallel texts • Extracting candidate pairs • Random sampling • k-means
Future work • Better clustering methods • Hierarchical • Dimension reduction • About 10,000 dimension is too high • Processing real HTML texts from the Web