Extracting Parallel Texts from Massive Web Documents

Extracting Parallel Texts from Massive Web Documents Chikayama Taura lab. M2 Dai Saito

Construct Parallel Corpora from the Web --it was the black kitten's fault entirely. One thing was certain, that the WHITE kitten had had nothing to do with it. ――もうなにもかも、黒い子ネコのせいだったのです。一つ確実なのは、白い子ネコはなんの関係もなかったということ。 Purpose • Parallel corpus : a set of parallel texts • Parallel texts : translated pairs of texts 日本語 English

Parallel Texts • Useful resource for • Statistical machine translation • Dictionary construction • But… existing corpora are not enough • Amount • Small • Large human resource • Genre • Public Documents • Software Manuals • Language • Limited • English-French

Parallel Texts from the Web • Extracting Parallel Texts from Massive Web Documents • Very large amount of texts • Varied languages • Small human resource

Problems • How to detect parallel texts automatically • How to reduce calculation cost • To construct parallel corpus • Extract candidate pairs • Judge whether they really are parallel texts Web

Agenda • Introduction • Related work • Proposal • Detect parallel texts • Extract candidate pairs • Experiment • Conclusion

STRAND [Resnik et. al. 03] • URL Matching • Remove language-specific substrings[LSSs](Japanese : ja, jp, jpn, euc, sjis,…) • Match LSSs-removed URLs • Make a detail comparison http://www.hostname.com/index.html.en http://www.hostname.com/index.html.en http://www.hostname.com/index.html.ja http://www.hostname.com/index.html.ja

link link DOM Tree Alignment [Lei et. al. 06] • HTML→DOM Tree • Searching linked pages • “alt” tag • link name • Parallel link: a pair of the same hyperlinks in parallel texts “English version” “In English” etc…

… … … … Outline Web Crawler Extract candidate pairs Detect parallel texts

Detecting parallel texts • Low comparison cost • without HTML Information • word (noun) • semantic ID • comparison [Fukushima et.al. 06]

Semantic ID Conversion • Constructing a graph from dictionaries • Treating Japaneseand English texts in the same level • # of Semantic IDs:about 10,000 １ Sense 感覚意味 Movie ２映画 Film Hobby 趣味３ Taste 味

Texts to Vector テキスト 955 … 辞書 1704 … 数列 3173 辞書を使ってテキストを数列に変える。 1704 955 3173 sort (955, 1704, 3173) +position information

Comparison • tscore (translation score) T1:(106, 335, 455, 567, 1704, 3173, 7421) T2:(335, 567, 567, 1704, 4014, 5449, 7421) score= 0 1 2 3 4 tscore = 4/(7+7)

tscore threshold • Fry Corpus[05 Fry] 400 pair • F-measure • Speed 200,000 pairs/sec • tscore threshold 0.102

Extract candidate pairs • Calculation cost of each comparison • Calculation cost of extracting parallel texts • A number of comparison: n^2 • URL matching is too strict • Japanese and English • 90,000,000URL → 4,000 URL pairs → 1,000 real pairs

Calculation Cost Reduction Sample →Reducing the number of comparison • distance score : tscore • Compare only texts close to each other Distance of each parallel texts and a sample text should be equal English 日本語

Calculation Cost Reduction • Flow • Select sample texts (<<n) • Calculate distance score with sample texts • Classify top m score • Compare only for texts in the same group

Sampling • Number of sample • Calculation cost • Accuracy (low risk of miss labeling) • Methods to select sample • Random • k-means

k-means k=2 • Select k samples • Classify all texts • Calculate centers • Re-classify

Calculation of tscore in k-means Text1:(106, 335, 455, 567, 1704, 3173, 7421) Text2:(335, 567, 567, 1704, 4014, 5449, 7421) tscore = 4/(7+7) normal k-means Text1:(106, 335, 455, 567, 1704, 3173, 7421) Average1:((567, 0.2), (4014, 0.14), (7421, 0.5), …) tscore = (0.2+0.5)

Converting HTML on the Web • Guess language • English, SJIS, EUC-JP, UTF-8 • Convert character code • Remove HTML Tag • Morphological Analysis→pickup noun

Experiment • Calculation Cost • Accuracy v.s. Calculation time • Clustering • k-means

Environment • Dataset：Fry Corpus [Fry 05] • Corpus of Japanese-English news pages • Convert HTML to Semantic ID in advance • Machine • CPU : Xeon 2.4GHz Dual • Memory : 2GB • OS : Linux (Debian)

Calculation Cost • Fry Corpus • 200 - 6400 pairs Normal All-to-All Random sampling (Top3) • # of texts grows, gap becomes wider • Low cost with n^2 samples

Accuracy v.s. Calculation time • Fry Corpus • 400 pairs • Random sampling • # of sample grows, • Miss classification ratio → high • Execution time → low • Trade off with Miss classification ratio and Execution time

Sample selection with k-means • Accuracy and Execution time with k-means • Flow • Random sampling • number of samples : √n • Calculating the center and re-sampling • Measuring Miss-classification ratio and Execution time

Evaluation of k-means • Low miss-classification ratio→High biased

Conclusion and Future work • Parallel texts from the Web • Detecting parallel texts • Extracting candidate pairs • Random sampling • k-means

Future work • Better clustering methods • Hierarchical • Dimension reduction • About 10,000 dimension is too high • Processing real HTML texts from the Web

Thank you for your attention!

Extracting Parallel Texts from Massive Web Documents

Extracting Parallel Texts from Massive Web Documents

Presentation Transcript

Information Extraction from Web Documents

SC623 TEXTS AND DOCUMENTS

Extracting Predicates from Semi-structured and Unstructured Texts

From Web Documents to Web Applications

ELIJAH: Extracting Genealogy from the Web

Massive Effective Search from the Web

Extracting and Organizing Facts of Interest from OCRed Historical Documents

Extracting Relations from XML Documents

Extracting Instances of Relations from Web Documents using Redundancy

Extracting semantic role information from unstructured texts

Tools for Extracting Metadata and Structure from DTIC Documents

From Web Documents to Web Applications

Extracting biological names and relations from texts

Extracting Structured Data from Web Page

Extracting models from design documents with Mapster

Automatic Concept Identification: Extracting Problem Solved Concepts From Patent Documents

Large Scale Crawling the Web for Parallel Texts

Extracting Structured Data from Web Pages

Extracting Semistructured Information from the Web

Extracting knowledge from the World Wide Web

Extraction of Bilingual Information from Parallel Texts

Extracting Math from PostScript Documents