1 / 25

Summarization of XML Documents

Summarization of XML Documents. K Sarath Kumar. Outline. Motivation System for XML Summarization Ranking Model and Summary Generation Example Summaries Conclusion and Future Work. Motivation. XML Document Collection (eg: IMDB). XML Document.

chavi
Download Presentation

Summarization of XML Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Summarization of XML Documents K Sarath Kumar

  2. Outline Motivation System for XML Summarization Ranking Model and Summary Generation Example Summaries Conclusion and Future Work

  3. Motivation XML Document Collection (eg: IMDB) XML Document • Types of XML Document Summaries • Generic summary – summarizes entire contents of the document. • Query-biased summary – summarizes those parts of the document which are relevant to user’s query.

  4. Aims • We aim at summaries which are : • Generated Automatically • Highly constrained by size • Highly Informative • High Coverage • Challenges • Structure is as important as text • Varying text length

  5. System for XML Summarization Summary Size Corpus Statistics SUMMARY GENERATOR RANKING UNIT Ranked Tag units XML Doc Info Unit Generator Tag Ranker Tag Units Summary Text Ranker Text Units Ranked Text units

  6. Information Units of an XML Document • Tag • Regarded as metadata • Can be highly redundant • Can be encoded into Schema DTD • Text • Instance for the tag • Much less redundant • Have different sizes

  7. Ranking Unit I. Tag Ranking Typicality : How salient is the tag in the corpus?E.g.:<title> • Typical tags define the context of the document • Occur regularly in most or all of the documents • Quantified by fraction of documents in which the tag occurs (df) Specialty : Does the tag occur more/less frequent in this document? • Special tags denote a special aspect of the current document • Occurs too many or too few times in the current document than usual • Quantified by deviation from average number of occurrences per document

  8. II. Text Ranking • Two categories of text • Entities • Regular text

  9. Ranking is done based on context of occurrence. • - No redundancy in tag context (E.g.: actor names, genre) • Redundancy in tag context (E.g.: plots, goofs, trivia items) Tag context Document context Corpus context

  10. Let and Correlated tags and text Often find related tag units – siblings of each other E.g.: Actor and Role Inclusion Principle Case 1 : Case 2 :

  11. Generation of Summary Consider the following tag rank table : To generate a summary with 30 tags, 15 actor tags, 9 keyword tags and 6 trivia would be required. Distribute the remaining “tag-budget” by re-normalizing the distribution of available tags

  12. Generating the summary with 30 tags

  13. Few Example Summaries Titanic.xml - Summaries

  14. Conclusion • A fully automated XML summary generator • Ranking of tags and text based on the ranking model • Generation of summary from ranked tags & text within memory budget • User Evaluation is underway Future Work • Rewriting the structure of the xml documents during summarization • Possible usage of text summarizers for long text • Query-biased xml summary generation

  15. Thanks!

  16. Appendix Informativeness

  17. Coverage

  18. RankingModel I. TAG RANKER Mixture Model of Typicality and Specialty • Typicality:How typical is the tag in the corpus?

  19. Specialty : How unusually frequent/infrequent is the tag in the current document compared to an average document of the corpus?

  20. Text with redundancy in tag context Sort terms by frequencies and take top ‘m’ terms as centroid query Relevance : Similarity : Calculated using Maximum marginal relevance(MMR) Finally,

  21. Text without redundancy in tag context Redundancy at tag level : No redundancy at tag level : is set empirically

  22. A Relative Count Matrix is constructed • Given two tags Ti and Tj, the relative importance of Tj with that of higher ranked Tj is calculated by dividing them both by P(Tj|D) (shows how many Tj tags are worth one Ti) • Tj is considered only after P(Ti|D)/P(Tj|D) number of Ti tags have been considered. • Extending the above concept, a matrix with relative counts can be formed.

  23. Ocean’s Eleven.xml - Summaries

  24. Generating the summary with 30 tags

More Related