1 / 16

DAISY Dutch lAnguage Investigation of Summarization technologY

DAISY Dutch lAnguage Investigation of Summarization technologY. Katholieke Universiteit Leuven Rijksuniversiteit Groningen Q-go. DAISY on one slide. Segmentation Rhetorical classification Sentence compression Sentence generation. Summarization of web content.

khuong
Download Presentation

DAISY Dutch lAnguage Investigation of Summarization technologY

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DAISYDutch lAnguage Investigationof Summarization technologY KatholiekeUniversiteit Leuven Rijksuniversiteit Groningen Q-go

  2. DAISY on one slide Segmentation Rhetorical classification Sentence compression Sentence generation Summarization of web content Improvement question answering, e.g. e-mail answering Multi-document summarization: Detect differences

  3. Overview Report of our current progress in: • Corpus building and preprocessing • Segmentation • Sentence generation

  4. Corpus Building and Preprocessing Target: corpus of questions, short texts and webpages about the same topic • Freely available: • UWV (questions & answer texts) • SVB (questions) • Available for internal use: KLM (questions, answer texts, web pages) • Todo: • web pages SVB • ABN AMRO (committed,not delivered)

  5. Corpus Building and Preprocessing • POS-tagged and parsed: KLM and UWV • SVB corpus: in progress • Coreference resolution: in progress

  6. Segmentation • Find main content in webpage • Smaller segments • Can be obtained from HTML structure • <H#>, <P>, <BR>, <UL>, ... • Hierarchical • Will be refined in relation to rhetorical roles

  7. Segmentation

  8. Segmentation

  9. Segmentation • Search for block with highest density of text

  10. Segmentation

  11. Segmentation • Additional heuristics to extend the selection: • Find closing tags for all tags that were opened in the selection • Include all text delimited by known tag patterns occurring just before and after the selection • Take the smallest enclosing DIV block

  12. Sentence generation • Specification of abstract dependency trees • Specify grammatical relations between lexical items and constituents dominating over lexical items • Alpino dependency trees without adjacency information • More variation through underspecification in lexical items, handling of particles

  13. Sentence generation • Initial implementation generator: • Chart generator (Kay, 1996) • Top-down guidance through expected dependency relations • Generates substantial part of input created from the Alpinotestsuites • Included in recent Alpino versions • Further work: optimization (time and space)

  14. Sentence generation • Selecting the most fluent sentence through fluency ranking: • N-gram language model • Log-linear model • Experiments with Velldall (2007) and parse disambiguation feature templates. • Need more insight about feature overlap • Experiment with more feature templates

  15. Sentence generation • Evaluation: • Corpus sentences used as a reference for the most fluent realization • Fairly strict, since there can be multiple fluent sentences • Where is the ceiling? • More annotated material! • FLAN: FLuencyANnotator (web application)

  16. Thanks!

More Related