160 likes | 254 Views
Evaluation of a Cross-lingual Romanian-English Multi-document Summariser. Constantin Orasan and Oana Andreea Chiorean Research Group in Computational Linguistics University of Wolverhampton Wolverhampton, United Kingdom. Structure. Introduction The summarisation method used
E N D
Evaluation of a Cross-lingual Romanian-English Multi-document Summariser Constantin Orasan and Oana Andreea Chiorean Research Group in Computational Linguistics University of Wolverhampton Wolverhampton, United Kingdom
Structure • Introduction • The summarisation method used • Evaluation of Romanian summaries • Evaluation of English summaries • Conclusions & future work
Introduction • Automatic summarisation (AS) offers a way to access large amounts of information by giving you its gist • Machine translation (MT) offers a way to access information in a language not known by the reader • AS + MT = Cross-lingual multi-document summarisation • We investigate whether cross-lingual multi-document summarisation offers a good way to access Romanian information by English speakers
The summarisation method • The method used to produce summaries is Maximal Marginal Relevance (MMR) (Goldstein et al., 2000) • The method works on clusters of related documents linked to a user topic • ht://dig was used to extract these clusters • Snippets with up to 10,000 characters and first 50 snippets returned were used
Maximal Marginal Relevance (MMR) • Chosen because requires very few language dependent tools: it requires a sentence splitter, a tokenizer and a stoplist • The formula used has two components: • Maximises the similarity to the user topic • Minimises the redundant information in the summary • A factor λ controls the influence of each of these components • The summary is built in an iterative process
Evaluation • For both Romanian and English summaries a task-based method was used • A corpus of Romanian articles published between 2001 and 2005 was built • 5 topics were selected: • ARDAF wants to pay to stop Petrovschi scandal • Basescu forms the government with UDMR and PUR • American bases in Romania • Flat-tax rate from 1st of January 2005 • Romanian journalists kidnaped in Iraq
Evaluation (II) • Multiple choice questions were manually produced without looking at the produced summaries • Judges had to answer the questions on the basis of the summaries given and not their knowledge about the events • The quality of a summary was given by the number of correctly answered questions • An “I don’t know” answer was added so the judges do not try to guess the answer • Coherence marked on a scale from 1 to 5.
Examples of questions • With which parties is Basescu hoping to achieve a parliamentary majority? • PUR and UDMR • PSD and PRM • PUR and PSD • UDMR and PSD • I don’t know • Is NATO interested in establishing military bases in Romania? • Yes • No • I don’t know
Evaluation of Romanian summaries • Summaries evaluated: • Baseline: the first sentence of the retrieved articles until the desired limit was reached • “Perfect summaries”: human produced summaries • 4 versions of MMR: λ = 0.5, 0.6, truncation to 5 and 6 characters • Summaries of about 2000 characters including whitespaces • 60 judges, 10 different people evaluated each summary
Evaluation results • 1 – Human summaries • 2 – Baseline • 3 – 6 MMR • The best results for automatic summaries • truncation to 6 chars, • λ = 0.6, • stoplist • TF*IDF
Evaluation of English summaries • The summaries produced by the best method were automatically translated using eTranslator • The questions and their answers manually translated • 29 judges answered 414 questions
Evaluation of English summaries • On all the topics the number of correctly answered questions reduces • Attempts to identify a whether a category of questions (Yes/No, questions which have the answer a number) could be answered better than other did not reveal any pattern • Feedback from the judges indicated that even though they could locate the answer of a question, in many cases they could not understand the whole summary due to poor translation
Poor translation • In momentul de fata, Belu cistiga aproximativ 1.000 de euro pe luna, in timp ce Bitang - 500 de euro.In the girlish moment, Belu gains about 1.000 of euro on month, in while Bitang - 500 of euro. • Ministrul Mircea Pascu a declarat ieri ca instalarea unor baze americane in Romania este o consecinta a statului nostru de viitor membru al NATO, aflat la granita Aliantei.The minister Mircea stated Pascu yesterday as the the of a installation american bases in Romania is stood our of future limb of NATO, finded out to boundary Aliantei.
Poor translation • Judges : • said “The meaning of the texts seemed almost graspable, but just beyond my mental powers.” • compared the texts with a certain character’s speech from ‘The fast show’, a British comedy programme. • It seems that the summaries contain important information, but it is highly unlikely that anyone will discover it because readers will give up reading the summary after one or two sentences
Conclusions & future work • Cross-lingual Romanian-English Multi-document Summarisation may be an option if the quality of translation engine improves (try the Romanian to English google translation engine?) • Develop a summarisation method which “guesses” how easily it is to translate a sentence • Translate and evaluate more summaries • Try this approach for other pairs of languages which have better translation engines