1 / 28

Topic Segmentation with Shared Topic Detection and Alignment of Multiple Documents

Topic Segmentation with Shared Topic Detection and Alignment of Multiple Documents. Bingjun Sun, Prasenjit Mitra, Hongyuan Zha, C.Lee Giles, John Yen. Introduction. Capture the local and sequential information of documents Topic detection and tracking Topic segmentation

jersey
Download Presentation

Topic Segmentation with Shared Topic Detection and Alignment of Multiple Documents

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic Segmentation with Shared Topic Detection and Alignment of Multiple Documents Bingjun Sun, Prasenjit Mitra, Hongyuan Zha, C.Lee Giles, John Yen

  2. Introduction • Capture the local and sequential information of documents • Topic detection and tracking • Topic segmentation • Two issues of Topic segmentation • Text stream segmentation • Application in automatic speech recognition • Coherent document segmentation • Document spilt into sub-topics • Application in partial-text query of long document in information retrieval • Problem of Topic segmentation • Traditional approaches perform topic segmentation on documents one at a time. • Most of them perform badly.

  3. Introduction • Deal with stop words: • not remove stop words directly • Removing common stop words may result in the loss of useful information in a specific domain • Hard classification of stop words and non stop words cannot represent the gradually changing amount of information content of each word. • Employ a soft classification using term weight.

  4. The main structure of this paper • For each document segment segment segment sentence

  5. The main structure of this paper

  6. Mutual Information • Measure how dependent T and S are: Minimize the loss Maximizing

  7. Weighted Mutual Information • Categorize terms into four types • Common stop words • Common stop words are common both along the dimensions of documents and segments. • Document-dependent stop words • Document-dependent stop words that depends on the personal writing style are common only along the dimension of segments for some documents. • Cue words • Cue words are common along the dimension of documents only for the same segment, and they are not common along the dimensions of segments. • Noisy words • Noisy words are other words which are not common along document and segment dimensions.

  8. Weighted Mutual Information

  9. Weighted Mutual Information • Common stop words • Document-dependent stop words • Cue words • Noisy words

  10. Weighted Mutual Information

  11. Algorithm • Stage1: initialization • Segment: • Segment documents equally by sentences. • Find the optimal segmentation just for each document which maximizes the WMI. • Alignment: • Assume that the order of segments for each document is the same. • Term cluster: • Randomly set cluster labels.

  12. Algorithm • Stage 2: • Find the best cluster • For each document, check all sequential segmentations, find the best one. The cycle is repeated to find a local maximum based on MI until it converges.

  13. Evaluation • There are predicted segmentation results and real segmentation result • real: real segmentation • pred: predicted segmentation • diff: pair of words in different segments. • same: pair of words real in the same segment. • miss: pair of words in different segments for real segmentation, but predicted as in the same segment. • false alarm: pair of words in the same segment, but predicted as in different segments.

  14. Evaluation False alarm miss

  15. Experiment 1- • 700 samples • Each is concatenation of ten segments. Each segment is the first n sentence selected randomly from the Brown corpus.

  16. Experiment 2- shared topic detection • 80 news articles from Google News • 8 topics and each has 10 articles Sentence in the same topic Sentences set For a pair of documents

  17. Experiment 3- multi-document segmentation • The data set is the introduction part of a lab report selected from the corpus of Biol 240W, Pennsylvania State University. • 102 samples and 2264 sentences. • Each sample has two segments. • The sentences range of the samples is from 2 to 56 sentences

  18. Experiment 3- multi-document segmentation

  19. Result

  20. Result

  21. Conclusion • This paper proposed a novel method for multi-document topic segmentation and alignment based on weighted mutual information, which can also handle single-document cases. • Natural segmentations like paragraphs are hint that can be used to find the optimal boundaries. • Supervised learning

  22. Title generation

  23. System design Training corpus New Document input Term frequency of word in new documents Term frequency of word in documents Term frequency of word in titles Using NBL and NBF system Count P(T|W) P(w) Top K words output Title generation

  24. NBL (Naïve Bayesian Title Generation with Limited Vocabulary) • From the training corpus, count the occurrence of each document-word-title-word pair whose document word and title word is the same. Word w在此訓練語料中的term frequency Word w在此訓練語料中相對應title的term frequency

  25. NBL (Naïve Bayesian Title Generation with Limited Vocabulary) • To generate a title for a new document 若w出現越多在此document中 ,則此篇document的title越形重要

  26. NBF (Naïve Bayesian Approach with Full Vocabulary) • Count the occurrence of document-word-title-word pairs when the word is dw and the title word is tw. • Count the probability • Combine with the information of the new document, generate the probability for each possible title word.

  27. NBL

  28. NBF

More Related