1 / 30

Deriving Paraphrases for Highly Inflected Languages from Comparable Documents

Deriving Paraphrases for Highly Inflected Languages from Comparable Documents. Kfir Bar, Nachum Dershowitz Tel Aviv University, Israel. Paraphrases. phrase level. I spilled the beans and told Jacky I loved her. I exposed my secret about my personal life. sentence level.

jaclyn
Download Presentation

Deriving Paraphrases for Highly Inflected Languages from Comparable Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deriving Paraphrases for Highly Inflected Languagesfrom Comparable Documents Kfir Bar, NachumDershowitz Tel Aviv University, Israel

  2. Paraphrases phrase level I spilled the beans and told Jacky I loved her I exposed my secret about my personal life sentence level Beijing’s policy toward Taiwan remains unchanged China did not change its policy toward Taiwan

  3. Motivation? MT coverage problem Arabic covered ngrams parallel corpus size

  4. Related work on paraphrasing • Continuing our previous work on Arabic synonyms (Bar and Dershowitz, AMTA, 2010) • Using parallel corpus (Callison-Burch et al., 2006) • Using monolingual corpus (Marton et al., 2009) • Using comparable documents (Wang and Callison-Burch, 2011)

  5. Why Arabic? • Being a Semitic language, Arabic is highly inflected and she learns it = وتدرسها conjunction direct object root pattern

  6. Extracting paraphrases • Inspired by:Extracting Paraphrases from a Parallel Corpus, Regina Barzilayand Kathleen R. McKeown (2001) • Working on Arabic comparable documents

  7. Preparing the corpus • Using Arabic Gigaword. We automatically paired documents – • published on the same day • maximize the cosinesimilarity over the lemma-frequency vector AFP XIN 24.12.2002 24.12.2002 max cos similarity 24.12.2002 24.12.2002 25.12.2002 24.12.2002 27.12.2002 27.12.2002

  8. Preparing the corpus • 690 document pairs • Manual evaluation by two Arabic speakers: • randomly selected 120 document pairs • question: “Do both documents discuss the same event”?

  9. Preprocessing • AMIRAN [Diab et al. – to appear]is a tool for finding context-sensitive morpho-syntactic information • Segmentation • Diacritized lemma • Stem • Full part-of-speech tag • Base-phrase tag • Named-entity-recognition (NER) tag

  10. Extracting paraphrases: co-training technique alignment paraphrases ✗ ✗ extracting pairs of phrases iterations co-training (context <-> phrase) paraphrases ✔

  11. Extracting pairs of phrases A magnitude 6.0 earthquake on the Richter scale occurred at 11:24 a.m. A strong undersea earthquake hit eastern Taiwan Wednesday • Phrases: • containing at least one non-functional word • do not break base-phrase in the middle • A • magnitude • 6.0 • earthquake • on • the • Richter • scale • occurred • at • 11:24 • a.m. • A • Strong • Undersea • Earthquake • hit • eastern • Taiwan • Wednesday …

  12. Co-training dEAxAfyyrswlAnAAlmnsqAl>ElyllsyAspAlxArjypwAl>mnyp dEAxAfyyrswlAnAAlmmvlAl>ElYllsyAspAlxArjypfy Outer (Context) Inner (Phrase) Outer (Context)

  13. Extracting paraphrases We maintain two sets labeled unlabeled positive = paraphrases negative = NOT paraphrases instances

  14. Single iteration 1 2 Unlabeled Training Outer Using Outer 3 4 Labeled Training Inner Using Inner next iteration Aggregation positive + negative instances using confidence score Deterministic labeling paraphrases

  15. Deterministic labeling of potential paraphrases Labeling similar phrases as positive A magnitude 6.0 earthquake on the Richter scale occurred at 11:24 a.m. Wednesday in the waters off Hualian, eastern Taiwan, with no immediate reports of casualties or property damage, the Central Weather Bureau (CWB) said. The quake's epicenter was 76 kilometers southeast of Hualien, according to the CWB. A strong undersea earthquake hit eastern Taiwan Wednesday, and there are no immediate reports of damage or casualties, according to reports from Taipei. The earthquake registering 6.0 on the Richter scale struck at 11:24 a.m. local time (0324 GMT), was about 76 km southeast of Hualienon the eastern coast, at a depth of 4 km, Taiwan's Central Weather Bureau said in a statement.

  16. Deterministic labeling of potential paraphrases • Negative examples are also labeled • in the first iteration (single words):words don’t have similar gloss values • not using in subsequent iterations

  17. The outer (context) classifier Features Using SVM, quadratic kernel

  18. The inner (phrase) classifier Features

  19. Experiments & results • Arabic • 240 document pairs (165K words) • 5 iterations

  20. Experiments & results

  21. Evaluation • 2 native speakers • Pairs are provided with their context • 4 labels: • paraphrases • entailment(e.g. a magnitude 6.0 earthquake  the quiver) • related(e.g. San Diego ~ Los Angeles) • wrong(e.g. a poor and little-developed province ≠ its resource-rich northwestern province)

  22. Manual evaluation

  23. Inner classifier, morphological features • Tested on 40 document pairs • Evaluation of 200 pairs

  24. Conclusions • We will try to better understand the effect of the morphological features on Arabic • Utilize the paraphrases for improving Arabic-English translation system

  25. Thank you kfirbar@post.tau.ac.il

  26. Manual evaluation English

  27. Experiments & results English

  28. Co-training was 76 kilometerssoutheast of Hualienaccording to the about 76 kmsoutheast of Hualienon the eastern Outer (Context) Inner (Phrase) Outer (Context)

  29. Manual evaluation Arabic

  30. Experiments & results Arabic

More Related