360 likes | 468 Views
A Monolingual Tree-based Translation Model for Sentence Simplification. Zhemin Zhu, UKP, TU Darmstadt, Germany Delphine Bernhard, LIMSI-CNRS, France Iryna Gurevych , UKP, TU Darmstadt, Germany. COLING2010 – Beijing, China. Presenter: Zhemin Zhu. An Example of Sentence Simplification.
E N D
A Monolingual Tree-based Translation Model for Sentence Simplification Zhemin Zhu, UKP, TU Darmstadt, Germany Delphine Bernhard, LIMSI-CNRS, France IrynaGurevych, UKP, TU Darmstadt, Germany COLING2010 – Beijing, China Presenter: Zhemin Zhu 24.08.2010| Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
An Example of Sentence Simplification This month was originallynamedSextilis in Latin, because it was the sixth month in the original [ten-month] Roman calendar under Romulus in 753 BC, when March was the first month of the year. -- Wikipedia This month was firstcalledSextilis in Latin, because it was the sixth month in the old Roman calendar. The Roman calendar began in March about 735 BC with Romulus. -- Simple Wikipedia 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Sentence Simplification Targeted at Humans Reading and Speech Assistance • People with Comprehension Disabilities [Carroll et al., 1999; Inui et al., 2003] • Low-literacy people [Watanabe et al., 2009] • Non-native Speakers [Siddharthan, 2002] • Children 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Sentence Simplification Targeted at NLP Applications • Question Generation • [Heilman and Smith, 2009] • Relation Extraction • [Miwa et al., COLING2010] • Information Extraction • [Jonnalagadda and Gonzalez, 2009] • Robot Command • [Young KY and Liu SH, 2002] • Parsing and Translation [Chandrasekar et al., 1996] • Summarization [Knight and Marcu, 2000] • Sentence Fusion [Filippova and Strube, 2008b] • Semantic Role Labeling [Vickrey and Koller, 2008] 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
What Makes a Sentence Difficult? This month was originally named Sextilis in Latin, because it was the sixth month in the originalten-month Roman calendar under Romulus in 753 BC, when March was the first month of the year. -- Wikipedia • Difficult Vocabulary→ Vocabulary (Word/Phrase) Substitution • Complex Syntax • Length → Splitting, Dropping • Order → Reordering, such as passive and active • Simplification operations: Splitting, Dropping, Reordering and Substitution 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Simplification Operation: Sentence Splitting August is the eighth month of the year in the Gregorian Calendar and one of seven Gregorian months with a length of 31 days. -- Wikipedia August is the eighth month of the year.It has 31 days. -- SimpleWikipedia 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Simplification Operation: Dropping April is the fourth month of the year [in the Gregorian Calendar, and one of four months] with [a length of] 30 days. -- Wikipedia April is the fourth month of the year with 30 days. -- SimpleWikipedia 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Simplification Operation: Reordering Mr. Anthony, who runs an employment agency, decries program trading, but he isn't sure it should be strictly regulated. -- [Siddharthan, 2006] Mr. Anthony decries program trading. Mr. Anthony runs an employment agency. But he isn't sure it should be strictly regulated. -- [Siddharthan, 2006] 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Simplification Operation: Substitution The traditional etymology is from the Latin aperire, "to open," in allusion to its being the season when trees and flowers begin to "open," which is supported by comparison with the modern Greek use of ἁνοιξις (opening) for spring. -- Wikipedia The name April comes from that Latin word aperire which means "to open". -- SimpleWikipedia 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Motivation • Most of the existing methods only cover one simplification operation: • [Siddharthan, 2006] and [Petersen and Ostendorf , 2007]: Splitting • Sentence Compression: Dropping • [Carroll et al. ,1999]: Word Substitution • In most cases, different simplification operations happen simultaneously. • It is necessary to model different simplification operations integrally. 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Our Contributions • The first statistical model: TSM (Tree-based Simplification Model) • Integrally covering splitting, dropping, reordering and word/phrase substitution • Based on the great successes of parsing and translation techniques. • An Efficient Training Method for TSM • Speeding up by monolingual word mapping • PWKP : Parallel Complex-Simple Dataset • Obtained from Wikipedia and Simple Wikipedia 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Tree-base Simplification Model: TSM Parse Trees of Complex Sentences Simple Sentences Probabilistic Model: EM Training 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Parallel Complex-Simple Dataset: PWKP • Paired articles from the Wikipedia and Simple Wikipedia • Article Pairing: following the “language links” • Plain Text Extraction: JWPL [Zesch et al., 2008] • Pre-processing: sentence boundary detection and tokenization with the Stanford Parser package [Klein and Manning, 2003], lemmatization with the TreeTagger [Schmid,1994] • Monolingual Sentence Alignment: sentence-level TF*IDF [Nelken and Shieber, 2006] 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Parallel Complex-Simple Dataset: PWKP Table 1: Monolingual Sentence Alignment Table 2: Statistics for the PWKP dataset 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
TSM: Splitting Example Complex Sentence: August was the sixth month in the ancient Roman calendar which started in 735BC. 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
TSM: Splitting • Question 1: Where to split the sentence? • Step 1: Segmentation • Question 2: How to make the split sentences complete and grammatical? • Step 2: Completion 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
TSM: Splitting Table 3: Segmentation Feature Table (SFT) Step 1: Segmentation 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
TSM: Splitting Step 1: Segmentation 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
TSM: Splitting Table 4: Border Drop Feature Table (BDFT) • Step 2: Completion • Should the “which” be dropped? 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
TSM: Splitting Table 5: Copy Feature Table (CFT) • Step 2: Completion • Which parts should be copied? • Where to put these parts in the new sentences? 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
TSM: Splitting 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
TSM: Dropping & Reordering Table 6: Dropping Feature Table (DFT) Table 7: Reordering Feature Table (RFT) 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
TSM: Dropping & Reordering 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
TSM: Word/Phrase Substitution Table 8: Substitution Feature Table (SubFT) • Word substitution: terminal nodes • Phrase Substitution: non-terminal nodes 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
TSM: Word/Phrase Substitution 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Speeding up • We filter out the unpromising candidates at the early stages. This is done using monolingual word mapping. 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Experiments • Testing dataset: 100 complex sentences 131 parallel simple sentences from PWKP • Baseline systems: • Moses: state-of-the-art phrase-based SMT • Compression (Filippova and Strube, 2008a) • Compression + Substitution • Substitution: Wordnet + Frequency in Simple Wikipedia Articles • Compression + Substitution + Splitting • Splitting: split at conjunctions and relatives. 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Experiments: Basic Statistics 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Experiments: Translation Assessment 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Experiments: Readability Assessment PE: Plain English Grade: School Year 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Conclusions Moses is not good at simplification tasks. BLEU and NIST arenot a good evaluation metrics for sentence simplification systems. TSM can achieve the best overall readability scores. We contributed the PWKP dataset: http://www.ukp.tu-darmstadt.de/software-data/data/quality-assessment/ 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Future Work More sophisticated features and rules to improve TSM Extend TSM’s expressiveness to model more complex transformations: synchronous syntax is a promising direction Evaluation methods for simplification systems: Readability Assessment 24.08.2010| Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Acknowledgements 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Thanks for your interests! Comments & Questions! 24.08.2010| Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Backup: Training Training (dataset){ • Initialize all probability tables using the uniform distribution; • for (several iterations){ • reset all cnt = 0; • for (each sentence pair < c; s > in dataset){ • tt = buildTrainingTree(< c; s >); • calcInsideProb(tt); • calcOutsideProb(tt); • update cnt for each conditioning feature in each • node of tt: cnt = cnt + node:insideP rob node:outsideP rob=root:insideP rob; • } • updateProbability(); } } EM algorithm: 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |
Backup: Training 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |