1 / 12

The Web as a Parallel Corpus

The Web as a Parallel Corpus. Parallel corpora are useful Training data for statistical MT Lexical correspondences for cross-lingual IR Early work: Hansards Canadian parliamentary proceedings French/English only Still most resources are in formal newspaper style only.

vashon
Download Presentation

The Web as a Parallel Corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Web as a Parallel Corpus • Parallel corpora are useful • Training data for statistical MT • Lexical correspondences for cross-lingual IR • Early work: Hansards • Canadian parliamentary proceedings • French/English only • Still most resources are in formal newspaper style only

  2. Harvesting parallel text from web • Strand: use similar structure to find likely translations • Using similar content to find translations • Applying methods to the Internet Archive, dramatically increasing quantity

  3. STRAND • Structural Translation Recognition Acquiring Natural Data • Architecture • Location of possible translations • Generation of candidate translations • Filtering of candidates based on structure

  4. Search for language in anchors (anchor: “English” OR anchor: “French”)

  5. Structural Filtering • Linearize HTML and discard content • Run through transducer to produce: • [START element-label] • [END element-label] • [CHUNK length]

  6. Align sequences using dynamic programming

  7. Scalar values • Dp: difference in # structural items that have no match • N: number of aligned non-markup chunks of different lengths • R: correlation of chunk lengths • P: significance level of the correlations

  8. Evaluation • Human judgments on 326 English-French paired pages • Using manually set thresholds on dp and n • 100% precision • 68.6% recall • Similar results on English/Chinese; English/Spanish • Typically throws out 1/3 data • Using machine learning: recall: 84% precision: 96%

  9. Drawbacks of structural matching • Not all translations have similar structures • Not all texts use HTML markup

  10. Content-based matching • Seed: bilingual lexicon • Link: pair x is in L1 and y in L2 • Probability that x a translation of y given by bilingual lexicon • Want most probable link sequence that could account for a pair of texts • Product of the probability of links • Best set of links using Maximum Weighted Bipartite Matching

  11. Cross-language similarity score: tsim • Computed on first 500 words of a document for efficiency

  12. Experiment • Dictionary • English/French dictionary: 34,808 entries • Dictionary of English/French cognates: 35,513 pairs • Additional web pairs: 11,264 from Bible • Final lexicon: 132,155 pairs • Trained threshold for t-sim on 32 pairs from Strand test set • Strand (manual): Fmeasure of .81 • Tsim: F-measure of .88 • Combined model: F-measure .977

More Related