300 likes | 438 Views
Deriving Paraphrases for Highly Inflected Languages from Comparable Documents. Kfir Bar, Nachum Dershowitz Tel Aviv University, Israel. Paraphrases. phrase level. I spilled the beans and told Jacky I loved her. I exposed my secret about my personal life. sentence level.
E N D
Deriving Paraphrases for Highly Inflected Languagesfrom Comparable Documents Kfir Bar, NachumDershowitz Tel Aviv University, Israel
Paraphrases phrase level I spilled the beans and told Jacky I loved her I exposed my secret about my personal life sentence level Beijing’s policy toward Taiwan remains unchanged China did not change its policy toward Taiwan
Motivation? MT coverage problem Arabic covered ngrams parallel corpus size
Related work on paraphrasing • Continuing our previous work on Arabic synonyms (Bar and Dershowitz, AMTA, 2010) • Using parallel corpus (Callison-Burch et al., 2006) • Using monolingual corpus (Marton et al., 2009) • Using comparable documents (Wang and Callison-Burch, 2011)
Why Arabic? • Being a Semitic language, Arabic is highly inflected and she learns it = وتدرسها conjunction direct object root pattern
Extracting paraphrases • Inspired by:Extracting Paraphrases from a Parallel Corpus, Regina Barzilayand Kathleen R. McKeown (2001) • Working on Arabic comparable documents
Preparing the corpus • Using Arabic Gigaword. We automatically paired documents – • published on the same day • maximize the cosinesimilarity over the lemma-frequency vector AFP XIN 24.12.2002 24.12.2002 max cos similarity 24.12.2002 24.12.2002 25.12.2002 24.12.2002 27.12.2002 27.12.2002
Preparing the corpus • 690 document pairs • Manual evaluation by two Arabic speakers: • randomly selected 120 document pairs • question: “Do both documents discuss the same event”?
Preprocessing • AMIRAN [Diab et al. – to appear]is a tool for finding context-sensitive morpho-syntactic information • Segmentation • Diacritized lemma • Stem • Full part-of-speech tag • Base-phrase tag • Named-entity-recognition (NER) tag
Extracting paraphrases: co-training technique alignment paraphrases ✗ ✗ extracting pairs of phrases iterations co-training (context <-> phrase) paraphrases ✔
Extracting pairs of phrases A magnitude 6.0 earthquake on the Richter scale occurred at 11:24 a.m. A strong undersea earthquake hit eastern Taiwan Wednesday • Phrases: • containing at least one non-functional word • do not break base-phrase in the middle • A • magnitude • 6.0 • earthquake • on • the • Richter • scale • occurred • at • 11:24 • a.m. • A • Strong • Undersea • Earthquake • hit • eastern • Taiwan • Wednesday …
Co-training dEAxAfyyrswlAnAAlmnsqAl>ElyllsyAspAlxArjypwAl>mnyp dEAxAfyyrswlAnAAlmmvlAl>ElYllsyAspAlxArjypfy Outer (Context) Inner (Phrase) Outer (Context)
Extracting paraphrases We maintain two sets labeled unlabeled positive = paraphrases negative = NOT paraphrases instances
Single iteration 1 2 Unlabeled Training Outer Using Outer 3 4 Labeled Training Inner Using Inner next iteration Aggregation positive + negative instances using confidence score Deterministic labeling paraphrases
Deterministic labeling of potential paraphrases Labeling similar phrases as positive A magnitude 6.0 earthquake on the Richter scale occurred at 11:24 a.m. Wednesday in the waters off Hualian, eastern Taiwan, with no immediate reports of casualties or property damage, the Central Weather Bureau (CWB) said. The quake's epicenter was 76 kilometers southeast of Hualien, according to the CWB. A strong undersea earthquake hit eastern Taiwan Wednesday, and there are no immediate reports of damage or casualties, according to reports from Taipei. The earthquake registering 6.0 on the Richter scale struck at 11:24 a.m. local time (0324 GMT), was about 76 km southeast of Hualienon the eastern coast, at a depth of 4 km, Taiwan's Central Weather Bureau said in a statement.
Deterministic labeling of potential paraphrases • Negative examples are also labeled • in the first iteration (single words):words don’t have similar gloss values • not using in subsequent iterations
The outer (context) classifier Features Using SVM, quadratic kernel
The inner (phrase) classifier Features
Experiments & results • Arabic • 240 document pairs (165K words) • 5 iterations
Evaluation • 2 native speakers • Pairs are provided with their context • 4 labels: • paraphrases • entailment(e.g. a magnitude 6.0 earthquake the quiver) • related(e.g. San Diego ~ Los Angeles) • wrong(e.g. a poor and little-developed province ≠ its resource-rich northwestern province)
Inner classifier, morphological features • Tested on 40 document pairs • Evaluation of 200 pairs
Conclusions • We will try to better understand the effect of the morphological features on Arabic • Utilize the paraphrases for improving Arabic-English translation system
Thank you kfirbar@post.tau.ac.il
Manual evaluation English
Experiments & results English
Co-training was 76 kilometerssoutheast of Hualienaccording to the about 76 kmsoutheast of Hualienon the eastern Outer (Context) Inner (Phrase) Outer (Context)
Manual evaluation Arabic
Experiments & results Arabic